# Spacy

In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right.

- First, the tokenizer split the text on whitespace similar to the split() function.
- Then the tokenizer checks whether the substring matches the tokenizer exception rules. For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.
- Next, it checks for a prefix, suffix, or infix in a substring, these include commas, periods, hyphens, or quotes. If it matches, the substring is split into two tokens.

    <p align="center">
      <img src="./../assets/tokenization/spacy.jpg"><br>
      <a href="https://machinelearningknowledge.ai/complete-guide-to-spacy-tokenizer-with-examples/"><i>[Source]</i></a>
    </p>

The text used here is same as that used for other tokenizers modules. This is done to maintain the uniformity throughout the demo. Let's begin with the installation of spacy library:

`pip3 install spacy`

Load the model of your desired language (English in this case), create an object for the loaded model and then iterate over the tokens of the object.

Remember to download the language model before creating an object of the model. You can download one from the below models as per your requirement:

```
python3 -m spacy download en_core_web_sm
python3 -m spacy download en_core_web_md
python3 -m spacy download en_core_web_lg
```

or use this command to download them all:

`python3 -m spacy download en`

In [1]:
text = "Good muffins cost $3.88. Please buy me two of them.\n\nThanks.🙂😍"

In [2]:
import spacy

nlp = spacy.load('en_core_web_sm')
tokens = nlp(text)

print([token.text for token in tokens])

['Good', 'muffins', 'cost', '$', '3.88', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', '\n\n', 'Thanks', '.', '🙂', '😍']


### Defining Tokenizer with Default Settings

In [3]:
from spacy.lang.en import English

nlp = English()
tokenizer = nlp.tokenizer
tokens = tokenizer(text)

print([token.text for token in tokens])

['Good', 'muffins', 'cost', '$', '3.88', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', '\n\n', 'Thanks', '.', '🙂', '😍']


### Defining Tokenizer for Words in English Vocab

In [4]:
from spacy.tokenizer import Tokenizer

tokenizer = Tokenizer(nlp.vocab)
tokens = tokenizer(text)

print([token.text for token in tokens])

['Good', 'muffins', 'cost', '$3.88.', 'Please', 'buy', 'me', 'two', 'of', 'them.', '\n\n', 'Thanks.🙂😍']


### Defining Special Rules

In some cases, we need to customize tokenization rules. For example: We are breaking a `\n\n` token into two separate `\n` tokens. This is easily achievable by adding special rules to the tokenizer object and with the help of `ORTH` attribute.

In [5]:
from spacy.symbols import ORTH

nlp = spacy.load('en_core_web_sm')

special_case = [{ORTH: '\n'}, {ORTH: '\n'}]
nlp.tokenizer.add_special_case('\n\n', special_case)

tokens = nlp(text)
print([token.text for token in tokens])

['Good', 'muffins', 'cost', '$', '3.88', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', '\n', '\n', 'Thanks', '.', '🙂', '😍']


### Customizing Tokenizer

In [6]:
import re

special_cases = {'\n\n': [{ORTH: '\n\n'}]}
prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[-~]''')
simple_url_re = re.compile(r'''^https?://''')

In [7]:
nlp.tokenizer = Tokenizer(
    nlp.vocab,
    rules = special_cases,
    prefix_search = prefix_re.search,
    suffix_search = suffix_re.search,
    infix_finditer = infix_re.finditer,
    url_match = simple_url_re.match
)

tokens = nlp(text)
print([token.text for token in tokens])

['Good', 'muffins', 'cost', '$3.88.', 'Please', 'buy', 'me', 'two', 'of', 'them.', '\n\n', 'Thanks.🙂😍']
