# Introduction to [spaCy](https://spacy.io/)
*Industrial-Strength Natural Language Processing in Python*

- spaCy is a Python libary for NLP
- supports multiple languages, staistical models
- provides support for tokenization, word vectors, tagging, parsing, segmentation, and more

Setup Resources:
- [spacy101](https://spacy.io/usage/spacy-101) 
- [Introduction to NLP with spaCy](https://towardsdatascience.com/a-short-introduction-to-nlp-in-python-with-spacy-d0aa819af3ad)

To install, go to terminal and run 
```
pip install -U spacy
```
After installation, also need to download the language model 
```
python -m spacy download en_core_web_lg
```

To use spacy with English:
```
import spacy
nlp = spacy.load("en_core_web_lg")
```
Make sure you install in terminal first before trying to install in this jupyter notebook.

In [1]:
%%capture
# Install spacy for jupyter notebook.
try:
    from pip import main as pipmain
except:
    from pip._internal import main as pipmain
packages = ['spacy']
pipmain(['install'] + packages);

In [2]:
%%capture
!python -m spacy download en_core_web_lg

In [3]:
import spacy
nlp = spacy.load('en_core_web_lg')

### Tokenization
- split text into words, symbols, punctuation a.k.a. tokens

In [4]:
doc = nlp("The hungry, hungry catepillar ate all of the food, and then he became a butterfly!")
doc.text.split() 

['The',
 'hungry,',
 'hungry',
 'catepillar',
 'ate',
 'all',
 'of',
 'the',
 'food,',
 'and',
 'then',
 'he',
 'became',
 'a',
 'butterfly!']

Note that some of the punctuation gets attached to the previous word. We don't want that.

In [5]:
[token.orth_ for token in doc] 

['The',
 'hungry',
 ',',
 'hungry',
 'catepillar',
 'ate',
 'all',
 'of',
 'the',
 'food',
 ',',
 'and',
 'then',
 'he',
 'became',
 'a',
 'butterfly',
 '!']

remove punctuation by using `.is_punct`   
remove spaces by using: `.is_space`   
remove stop words by using the `.is_stop`   

In [6]:
[token.orth_ for token in doc if not token.is_punct | token.is_space | token.is_stop] 

['hungry', 'hungry', 'catepillar', 'ate', 'food', 'butterfly']

Note how all the punctuation, white spaces, and stop words have been removed and we are left only with the "important" words.

<b>Aside:</b> In the below example, the contraction gets split up. Trying using `nltk`'s `casual_tokenize` to split words instead.

In [7]:
text2 = "Hey!!! Find Jessica's website at https://www.google.com/"
doc2 = nlp(text2)
print(doc2.text.split())
[token.orth_ for token in doc2] 

['Hey!!!', 'Find', "Jessica's", 'website', 'at', 'https://www.google.com/']


['Hey',
 '!',
 '!',
 '!',
 'Find',
 'Jessica',
 "'s",
 'website',
 'at',
 'https://www.google.com/']

In [8]:
%%capture
packages = ['nltk']
pipmain(['install'] + packages);

from nltk.tokenize import casual_tokenize

In [9]:
casual_tokenize(text2)

['Hey',
 '!',
 '!',
 '!',
 'Find',
 "Jessica's",
 'website',
 'at',
 'https://www.google.com/']

### Stopwords
Stopwords are insignificant and can mess up frequency analysis so it's useful to remove stopwords. 

In [10]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

### Lemmatization
- reducing a word to its base form or root form
- reduce various wordforms to its citation form

use spacy's `.lemma_` method

In [11]:
lemma_words = "going gone went goes" 
nlp_lemma_words = nlp(lemma_words) 
[word.lemma_ for word in nlp_lemma_words] 

['go', 'go', 'go', 'go']

This is especially useful for text classification because lemmatising the text helps avoids word duplication for building models like bag of words model.

### Parts-of-speech (POS) Tagging
- assign the to words 
- spacy uses [Penn Treebank POS tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

use the `.pos_` and `.tag_` methods

In [12]:
doc2 = nlp("My dog's toy actually belongs to the neighbor's cat.") 
pos_tags = [(i, i.tag_) for i in doc2]
pos_tags

[(My, 'PRP$'),
 (dog, 'NN'),
 ('s, 'POS'),
 (toy, 'NN'),
 (actually, 'RB'),
 (belongs, 'VBZ'),
 (to, 'IN'),
 (the, 'DT'),
 (neighbor, 'NN'),
 ('s, 'POS'),
 (cat, 'NN'),
 (., '.')]

create a list of owner-possesion tuples:

In [13]:
[(i[0].nbor(-1), i[0].nbor(+1)) for i in pos_tags if i[1] == "POS"] 

[(dog, toy), (neighbor, cat)]

### Word Vectors
- the concept of word embeddings is that every word can be represented as a set of real numbers (vectors) that capture the word meaning and context
- each word has a unique embedding
- word embeddings are multidimensional
- similar words have similar embedding values

Resources:
- [spacy.io: Word Vectors and Semantic Similarity](https://spacy.io/usage/vectors-similarity)
- [Get Busy With Word Embeddings](https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/)
- [Word Embeddings in Python with Spacy and Gensim](https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/)

Spacy provides pre-trained models for word embeddings which downloaded when we downloaded the English model. Spacy can parse entire blocks of text and assigns word vectors using the loaded model. Then, use `.vector` to get the word vector. 

Important Note: spaCy's small models (models that end in `sm`) don't ship with word vectors. You can still use `.similarity` to compares, but the results won't be as good. To use real word vectors, make sure to download the large models:
```
python -m spacy download en_core_web_lg
```

In [14]:
tokens = nlp(u"cat dog water cloud")
print(tokens[0].text, tokens[0].vector)

cat [-0.15067   -0.024468  -0.23368   -0.23378   -0.18382    0.32711
 -0.22084   -0.28777    0.12759    1.1656    -0.64163   -0.098455
 -0.62397    0.010431  -0.25653    0.31799    0.037779   1.1904
 -0.17714   -0.2595    -0.31461    0.038825  -0.15713   -0.13484
  0.36936   -0.30562   -0.40619   -0.38965    0.3686     0.013963
 -0.6895     0.004066  -0.1367     0.32564    0.24688   -0.14011
  0.53889   -0.80441   -0.1777    -0.12922    0.16303    0.14917
 -0.068429  -0.33922    0.18495   -0.082544  -0.46892    0.39581
 -0.13742   -0.35132    0.22223   -0.144     -0.048287   0.3379
 -0.31916    0.20526    0.098624  -0.23877    0.045338   0.43941
  0.030385  -0.013821  -0.093273  -0.18178    0.19438   -0.3782
  0.70144    0.16236    0.0059111  0.024898  -0.13613   -0.11425
 -0.31598   -0.14209    0.028194   0.5419    -0.42413   -0.599
  0.24976   -0.27003    0.14964    0.29287   -0.31281    0.16543
 -0.21045   -0.4408     1.2174     0.51236    0.56209    0.14131
  0.092514   0.71396   -

Now we can use the word vectors we got from spacy to compare the similarity of the words using `.similarity`.

In [15]:
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

cat cat 1.0
cat dog 0.8016855
cat water 0.2888436
cat cloud 0.16586679
dog cat 0.8016855
dog dog 1.0
dog water 0.30933863
dog cloud 0.1380703
water cat 0.2888436
water dog 0.30933863
water water 1.0
water cloud 0.3084506
cloud cat 0.16586679
cloud dog 0.1380703
cloud water 0.3084506
cloud cloud 1.0


### Token Matching
Token matching entails using [Matcher](https://spacy.io/api/matcher) which matches sequences of token by pattern rules.    
Resources:
- [Rule-based Matching with spaCy](https://medium.com/@ashiqgiga07/rule-based-matching-with-spacy-295b76ca2b68)

In [16]:
from spacy.matcher import Matcher
from spacy.tokens import Span

matcher = Matcher(nlp.vocab)

Define a pattern and add it to the matcher. 

`LOWER` indicates that the lowercase form matches 
- [Token Attributes](https://spacy.io/api/token#attributes)
- [Available Tokens for Rule based matching](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes)

In [17]:
#define the pattern
pattern = [{'LOWER': 'computer', 'POS': 'NOUN'},
             {'POS':{'NOT_IN': ['VERB']}}]#add the pattern to the previously created matcher object
matcher.add("Matching", None, pattern)

Computer programming is the process of writing instructions that get executed by computers. The instructions, also known as code, are written in a programming language which the computer can understand and use to perform a task or solve a problem. Basic computer programming involves the analysis of a problem and development of a logical sequence of instructions to solve it. There can be numerous paths to a solution and the computer programmer seeks to design and code that which is most efficient. Among the programmer’s tasks are understanding requirements, determining the right programming language to use, designing or architecting the solution, coding, testing, debugging and writing documentation so that the solution can be easily understood by other programmers.Computer programming is at the heart of computer science. It is the implementation portion of software development, application development and software engineering efforts, transforming ideas and theories into actual, working solutions.

In [18]:
text = "Computer programming is the process of writing instructions that get executed by computers. The instructions, also known as code, are written in a programming language which the computer can understand and use to perform a task or solve a problem. Basic computer programming involves the analysis of a problem and development of a logical sequence of instructions to solve it. There can be numerous paths to a solution and the computer programmer seeks to design and code that which is most efficient. Among the programmer’s tasks are understanding requirements, determining the right programming language to use, designing or architecting the solution, coding, testing, debugging and writing documentation so that the solution can be easily understood by other programmers.Computer programming is at the heart of computer science. It is the implementation portion of software development, application development and software engineering efforts, transforming ideas and theories into actual, working solutions."
doc = nlp(text)
matches = matcher(doc)#print the matched results and extract out the results

In [19]:
for match_id, start, end in matches:
    # nlp.vocab.strings[match_id]  
    span = doc[start:end] 
    print("Indexes:", start, end, span.text)

Indexes: 0 2 Computer programming
Indexes: 45 47 computer programming
Indexes: 75 77 computer programmer
Indexes: 131 133 Computer programming
Indexes: 138 140 computer science


### Phrase Matcher
This allows us to match specific phrases and combinations of words.

In [20]:
from spacy.matcher import PhraseMatcher

In [21]:
matcher2 = PhraseMatcher(nlp.vocab, attr='LOWER')# the list containing the pharses to be matched
terminology_list = ["Machine Learning", "Hidden Structure",                     
                           "Unlabeled Data"]
patterns = [nlp.make_doc(text) for text in terminology_list]# add the patterns to the matcher object without any callbacks
matcher2.add("Phrase Matching", None, *patterns)

Supervised machine learning algorithms can apply what has been learned in the past to new data using labeled examples to predict future events. Starting from the analysis of a known training dataset, the learning algorithm produces an inferred function to make predictions about the output values. The system is able to provide targets for any new input after sufficient training. The learning algorithm can also compare its output with the correct, intended output and find errors in order to modify the model accordingly. In contrast, unsupervised machine learning algorithms are used when the information used to train is neither classified nor labeled. Unsupervised learning studies how systems can infer a function to describe a hidden structure from unlabeled data. The system doesn’t figure out the right output, but it explores the data and can draw inferences from datasets to describe hidden structures from unlabeled data. Semi-supervised machine learning algorithms fall somewhere in between supervised and unsupervised learning, since they use both labeled and unlabeled data for training – typically a small amount of labeled data and a large amount of unlabeled data. The systems that use this method are able to considerably improve learning accuracy. Usually, semi-supervised learning is chosen when the acquired labeled data requires skilled and relevant resources in order to train it / learn from it. Otherwise, acquiring unlabeled data generally doesn’t require additional resources.

In [22]:
# the input text string is converted to a Document object
doc2 = nlp("Supervised machine learning algorithms can apply what has been learned in the past to new data using labeled examples to predict future events. Starting from the analysis of a known training dataset, the learning algorithm produces an inferred function to make predictions about the output values. The system is able to provide targets for any new input after sufficient training. The learning algorithm can also compare its output with the correct, intended output and find errors in order to modify the model accordingly. In contrast, unsupervised machine learning algorithms are used when the information used to train is neither classified nor labeled. Unsupervised learning studies how systems can infer a function to describe a hidden structure from unlabeled data. The system doesn’t figure out the right output, but it explores the data and can draw inferences from datasets to describe hidden structures from unlabeled data. Semi-supervised machine learning algorithms fall somewhere in between supervised and unsupervised learning, since they use both labeled and unlabeled data for training – typically a small amount of labeled data and a large amount of unlabeled data. The systems that use this method are able to considerably improve learning accuracy. Usually, semi-supervised learning is chosen when the acquired labeled data requires skilled and relevant resources in order to train it / learn from it. Otherwise, acquiring unlabeled data generally doesn’t require additional resources.")#call the matcher object the document object and it will return #match_id, start and stop indexes of the matched words
matches2 = matcher2(doc2)#print the matched results and extract out the results
for match_id, start, end in matches2:
    span = doc[start:end] 
    print("Indexes:", start, end, span.text)

Indexes: 1 3 programming is
Indexes: 93 95 are understanding
Indexes: 122 124 solution can
Indexes: 125 127 easily understood
Indexes: 154 156 engineering efforts
Indexes: 160 162 theories into
Indexes: 178 180 
Indexes: 195 197 
Indexes: 243 245 


Other Resources
- [PythonForLinguistsTalk Gitlab](https://gitlab.com/andersonh/PythonForLinguistsTalk)
- [Complete Guide to spaCy](https://nlpforhackers.io/complete-guide-to-spacy/)
- [Natural Language Processing With spaCy in Python](https://realpython.com/natural-language-processing-spacy-python/)