# Spacy

The text used here is same as that used for other tokenizers modules. This is done to maintain the uniformity throughout the demo. Let's begin with the installation of spacy library:

`pip3 install spacy`

In [1]:
text = "Good muffins cost $3.88. Please buy me two of them.\n\nThanks."

In [2]:
import spacy

Load the model of your desired language (English in this case), create an object for the loaded model and then iterate over the tokens of the object.

Remember to download the language model before creating an object of the model. You can download one from the below models as per your requirement:

```
python3 -m spacy download en_core_web_sm
python3 -m spacy download en_core_web_md
python3 -m spacy download en_core_web_lg
```

or use this command to download them all:

`python3 -m spacy download en`

In [3]:
nlp = spacy.load('en_core_web_sm')
tokens = nlp(text)

print([token.text for token in tokens])

['Good', 'muffins', 'cost', '$', '3.88', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', '\n\n', 'Thanks', '.']


## Defining Tokenizer with Default Settings

In [4]:
from spacy.lang.en import English

In [5]:
nlp = English()
tokenizer = nlp.tokenizer
tokens = tokenizer(text)

print([token.text for token in tokens])

['Good', 'muffins', 'cost', '$', '3.88', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', '\n\n', 'Thanks', '.']


## Defining Tokenizer for Words in English Vocab

In [6]:
from spacy.tokenizer import Tokenizer

In [7]:
tokenizer = Tokenizer(nlp.vocab)
tokens = tokenizer(text)

print([token.text for token in tokens])

['Good', 'muffins', 'cost', '$3.88.', 'Please', 'buy', 'me', 'two', 'of', 'them.', '\n\n', 'Thanks.']


## Defining Special Rules

In some cases, we need to customize tokenization rules. For example: We are breaking a `\n\n` token into two separate `\n` tokens. This is easily achievable by adding special rules to the tokenizer object and with the help of `ORTH` attribute.

In [8]:
from spacy.symbols import ORTH

In [9]:
nlp = spacy.load('en_core_web_sm')

special_case = [{ORTH: '\n'}, {ORTH: '\n'}]
nlp.tokenizer.add_special_case('\n\n', special_case)

tokens = nlp(text)
print([token.text for token in tokens])

['Good', 'muffins', 'cost', '$', '3.88', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', '\n', '\n', 'Thanks', '.']


## Customizing Tokenizer

In [10]:
import re

In [11]:
special_cases = {'\n\n': [{ORTH: '\n\n'}]}
prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[-~]''')
simple_url_re = re.compile(r'''^https?://''')

In [12]:
nlp.tokenizer = Tokenizer(
    nlp.vocab,
    rules = special_cases,
    prefix_search = prefix_re.search,
    suffix_search = suffix_re.search,
    infix_finditer = infix_re.finditer,
    url_match = simple_url_re.match
)

In [13]:
tokens = nlp(text)
print([token.text for token in tokens])

['Good', 'muffins', 'cost', '$3.88.', 'Please', 'buy', 'me', 'two', 'of', 'them.', '\n\n', 'Thanks.']
