## Basic preprocessing with language models
```python
doc = nlp('this is a sentence.')
```

When you call nlp on Unicode text, spaCy first tokenizes the text to produce a Doc object. Doc is then processed in several different steps, what we also refer to as our pipeline.


![](https://i.imgur.com/b5eDIoW.png)


## Tokenising text


Different languages will have different tokenization rules. Let's look at an example of how tokenization might work in English. For the sentence – Let us go to the park., it's quite straightforward, and would be broken up as follows, with the appropriate numerical indices:

![](https://i.imgur.com/ZfABw3t.png)




![](https://i.imgur.com/4rPNwm1.png)


## Part of speach (POS) tagging


The second component of the default pipeline we described before was the tensorizer.

A tensorizer encodes the internal representation of the doc as an array of floats. This is a necessary step because spaCy's models are neural network models, and only speak tensors – every Doc object is expected to be tenzorised. We as users do not need to concern ourselves with this. After this step, we start with our first annotation – part of speech tagging.

In the first chapter, we briefly mentioned POS-tagging as marking each token of the sentence with its appropriate part of speech, such as noun, verb, and so on. spaCy uses a statistical model to perform its POS-tagging. To get the annotation from a token, we simply look up the pos_ attribute on the token.

Consider this example:

In [1]:
import spacy

nlp = spacy.load('en')
doc = nlp('John and I went to the park')

for token in doc:
  print((token.text, token.pos_))

ModuleNotFoundError: No module named 'spacy'

## Name entity recognition

We now have the last part of our pipeline, where we perform named entity recognition. A named entity is a real-world object that is assigned a name – for example, a person, a country, a product, or organization. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. We have to remember that since models are statistical and depend on the examples they were trained on, they don't always work perfectly and might need some tuning later, depending on your use case – we have a chapter saved up just to better understand named entity recognition and how to train our own models.Named entities are available as the ents property of a Doc:

In [4]:
doc = nlp(u'Microsoft has offices all over Europe.')
for ent in doc.ents:  
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Microsoft 0 9 ORG
Europe 31 37 LOC


spaCy has the following built-in entity types:

- PERSON: People, including fictional ones
- NORP: Nationalities or religious or political groups
- FACILITY: Buildings, airports, highways, bridges, and so on
- ORG: Companies, agencies, institutions, and so on
- GPE: Countries, cities, and states
- LOC: Non GPE locations, mountain ranges, and bodies of water
- PRODUCT: Objects, vehicles, foods, and so on (not services)
- EVENT: Named hurricanes, battles, wars, sports events, and so on
- WORK_OF_ART: Titles of books, songs, and so onLAW: Named documents made into laws
- LANGUAGE: Any named language

## Rule based matching 


- ORTH: The exact verbatim text of a token
- LOWER, UPPER: The lowercase and uppercase form of the token
- IS_ALPHA: Token text consists of alphanumeric chars
- IS_ASCII: Token text consists of ASCII characters
- IS_DIGIT: Token text consists of digits
- IS_LOWER, IS_UPPER, IS_TITLE: Token text is in lowercase, uppercase, and title
- IS_PUNCT, IS_SPACE, IS_STOP: Token is punctuation, whitespace, and a stop word
- LIKE_NUM, LIKE_URL, LIKE_EMAIL: Token text resembles a number, URL, and email
- POS, TAG: The token's simple and extended POS tag
- DEP, LEMMA, SHAPE: The token's dependency label, lemma, and shape

## PReprocessing

The wonderful thing about preprocessing text is that it almost feels intuitive – we get rid of any information which we think won't be used in our final output and keep what we feel is important. Here, our information is words – and some words do not always provide useful insights. In the text mining and natural language processing community, these words are called stop words [22]

Stop words are words that are filtered out of our text before we run any text mining or NLP algorithms on it. Again, we would like to draw attention to the fact this is not in every case – if we intend to find stylistic similarities or understand how writers use stop words, we would obviously need to stop words!There is no universal stop words list for each language, and it largely depends on the use case and what kind of results we expect to be seeing. Usually, it is a list of the most common words in the language, such as of, the, want, to, and have.With spaCy, stop words are very easy to identify – each token has an IS_STOP attribute, which lets us know if the word is a stop word or not. The list of all the stop words for each language can be found in the spacy/lang [20] folder.We can also add our own stop words to the list of stop words. For example

 we can also add our own stop words to teh list of stop words

In [6]:
#
my_stop_words = [u'say', u'be', u'said', u'says', u'saying', 'field']
for stopword in my_stop_words:  
    lexeme = nlp.vocab[stopword]  
    lexeme.is_stop = True

In [7]:
from spacy.lang.en.stop_words import  STOP_WORDS

print(STOP_WORDS)

{'whom', 'almost', 'so', 'were', 'first', 'ca', 'against', 'each', 'ever', 'herself', 'several', 'something', 'twelve', 'see', 'nobody', 'also', 'but', 'elsewhere', 'keep', 'less', 'enough', 'mine', 'thus', 'namely', 'yourself', 'else', 'twenty', 'around', 'did', 'whoever', 'what', 'whose', 'former', 'fifty', 'used', 'beyond', 'six', 'all', 'being', 'on', 'three', 'nowhere', 'really', 'you', 'within', 'after', 'another', 'just', 'top', 'it', 'anywhere', 'front', 'amount', 'many', 'move', 'name', 'towards', 'together', 'because', 'if', 'thereby', 'have', 'get', 'whole', 'before', 'however', 'my', 'somewhere', 'both', 'whereupon', 'per', 'seemed', 'again', 'an', 'go', 'between', 'and', 'us', 'of', 'her', 'via', 'when', 'no', 'though', 'due', 'fifteen', 'someone', 'besides', 'am', 'never', 'yet', 'why', 'does', 'once', 'along', 'about', 'nevertheless', 'thence', 'or', 'was', 'upon', 'become', 'full', 'own', 'everything', 'from', 'various', 'at', 'whence', 'very', 'cannot', 'becoming', 'ev

```python
STOP_WORDS.add('your_additional_stop_words_here')
```

#### Clean up a sentence

In [9]:
doc = nlp('the horse galloped down the field and past the river.')
sentence = []
for w in doc:
    # if it's not a stop word or punctuation mark, add it to our article! 
    if w.text != 'n' and not w.is_stop and not w.is_punct and not w.like_num:    
        # we add the lematized version of the word    
        sentence.append(w.lemma_)
print(sentence)

['horse', 'gallop', 'past', 'river']
