# spaCy's Language Models                                                        

In [1]:
import spacy

Basic preprocessing with language model

In [4]:
nlp = spacy.load('en')

In [5]:
doc = nlp('This is a sentence')

### Tokenizing text

## Part of speach (POS) tagging

In [6]:
doc = nlp('Jon and I wne tto the park')

for token in doc:
    print((token.text, token.pos_))

('Jon', 'PROPN')
('and', 'CCONJ')
('I', 'PRON')
('wne', 'VERB')
('tto', 'NOUN')
('the', 'DET')
('park', 'NOUN')


## Name entity recofniton
- real world object that is assinged a name


In [8]:
doc = nlp('Microsoft has offices all over Europe')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Microsoft 0 9 ORG
Europe 31 37 LOC


spaCy has the following built-in entity 
- types:PERSON: People, including fictional ones
- NORP: Nationalities or religious or political groups
- FACILITY: Buildings, airports, highways, bridges, and so on
- ORG: Companies, agencies, institutions, and so on
- GPE: Countries, cities, and states
- LOC: Non GPE locations, mountain ranges, and bodies of water
- PRODUCT: Objects, vehicles, foods, and so on (not services)
- EVENT: Named hurricanes, battles, wars, sports events, and so on
- WORK_OF_ART: Titles of books, songs, and so onLAW: Named documents made into lawsLANGUAGE: Any named language

## Rule based matching

- ORTH: The exact verbatim text of a token
- LOWER, UPPER: The lowercase and uppercase form of the token
- IS_ALPHA: Token text consists of alphanumeric chars
- IS_ASCII: Token text consists of ASCII characters
- IS_DIGIT: Token text consists of digits
- IS_LOWER, IS_UPPER, IS_TITLE: Token text is in lowercase, uppercase, and title
- IS_PUNCT, IS_SPACE, IS_STOP: Token is punctuation, whitespace, and a stop word
- LIKE_NUM, LIKE_URL, LIKE_EMAIL: Token text resembles a number, URL, and email
- POS, TAG: The token's simple and extended POS tag
- DEP, LEMMA, SHAPE: The token's dependency label, lemma, and shape



## Preprocessing
- get rid of info that it wont be used
- in text mining and nlp they are called **stop words**
- ex: *of, the, want, to, have,..*
- identify as `IS_STOP` attribute
-

### We can add our own stop words

In [10]:
my_stop_words = [u'say', u'be', u'said', u'says', u'saying', 'field']
for stopword in my_stop_words:
    lexeme = nlp.vocab[stopword]
    lexeme.is_stop = True

In [11]:
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS)

{'that', 'although', 'twenty', 'your', 'get', 'elsewhere', 'make', 'put', 'most', 'name', 'whatever', 'thereby', 'through', 'against', 'whereafter', 'next', 'an', 'before', 'but', 'to', 'with', 'other', 'herein', 'namely', 'itself', 'thus', 'another', 'cannot', 'if', 'me', 'who', 'both', 'yourselves', 'for', 'as', 'by', 'everywhere', 'might', 'beyond', 'ca', 'too', 'besides', 'seem', 'almost', 'upon', 'due', 'sometimes', 'somewhere', 'perhaps', 'various', 'last', 'towards', 'beside', 'also', 'am', 'him', 'throughout', 'anyway', 'my', 'go', 'there', 'top', 'already', 'of', 'however', 'mostly', 'because', 'himself', 'about', 'its', 'two', 'can', 'us', 'seems', 'when', 'wherever', 'sixty', 'under', 'call', 'on', 'yet', 'hers', 'yours', 'behind', 'or', 'former', 'while', 'not', 'somehow', 'thence', 'really', 'several', 'thereupon', 'was', 'whenever', 'hereafter', 'never', 'will', 'just', 'serious', 'whose', 'yourself', 'still', 'along', 'fifteen', 'otherwise', 'using', 'his', 'after', 'hen

```python
# add to STOP_WORDS

STOP_WORDS.add('your_word_list')

```

In [12]:
# Example: Clean up a sentence

doc = nlp(u'the horse galloped down the field and past the river.')
sentence = []

for w in doc:
  # if it's not a stop word or punctuation mark, add it to our article!
  if w.text != 'n' and not w.is_stop and not w.is_punct and not w.like_num:
    # we add the lematized version of the word
    sentence.append(w.lemma_)
print(sentence)

['horse', 'gallop', 'past', 'river']


using `.is_stop` `is_punct` and `.like_num` attributes removes parts of the sentece we dont need.
- the lemmatized form of the word can be acees through `.lemma_`
- can remove words base on use case.

## Summary         

spaCy offers us an easy way to annotate your text data very easily, and with the language model, we annotate your text data with a lot of information – not just tokenizing and whether it is a stop word or not, but also the part of speech, named entity tag, and so on – we can also train these annotating models on our own, giving a lot of power to the language model and processing pipeline! Downloading the models and using virtual environments are also an important part of this process. We will now move on to using our cleaned data in a way that machines can understand us – with vectors, and what kind of Python libraries we would need for the same.