# SpaCy

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')
nlp

<spacy.lang.en.English at 0x25a84623cd0>

## Detecting Email Address

Consider you've a text document about details of various employees.

What if you want the emails of employees to send a common email?

You can tokenize the document and check which tokens are emails through **like_email** attribute **like_email** returns **True** if the token is a email

In [3]:
# Text containing employee details 

employees_text = """ name: Karthik age: 45 email: karthik45@gmail.com
                     name: dinesh age: 40 email: dinesh40@gmail.com
                     name: Ganesh age: 67 email: ganeshan67@gmail.com
                     name: Kamlesh age: 27 email: kamlesh27@gmail.com
                     name: Kaushal age: 28 email: kaushal28@gmail.com
                     name: Vikram age: 45 email: Vikram45@gmail.com
                     name: Venkat age: 45 email: Venkat45@gmail.com"""

# Creating a spacy doc
employees_doc = nlp(employees_text)

In [6]:
# Printing the token which are email through like_email attribute

for token in employees_doc:
    if token.like_email:
        print(token.text)

karthik45@gmail.com
dinesh40@gmail.com
ganeshan67@gmail.com
kamlesh27@gmail.com
kaushal28@gmail.com
Vikram45@gmail.com
Venkat45@gmail.com


Likewise, spacy provides a variety of token attributes. Below is a list of those attributes and the function they perform

(-) token.is_alpha: Returns **True** if the token is an alphabet

(-) token.is_ascii: Returns **True** if the token belongs to ascii characters

(-) token.is_digit: Returns **True** if the token is a number(0-9)

(-) token.is_upper: Returns **True** if the token is upper case alphabet

(-) token.is_lower: Returns **True** if the token is lower case alphabet

(-) token.is_space: Returns **True** if the token is a space

(-) token.is_bracket: Returns **True** if the token is a bracket

(-) token.is_quote: Returns **True** if the token is a quotation mark

(-) token.like_url: Returns **True** if the token is similar to a URL (link to website)

## Part of Speech analysis with spaCy

Consider a sentence, **"Emily like playing football"**

Here, Emily is a NOUN, and playing is a VERB. Likewise, each word of a text is either a noun, pronoun, verb, conjection etc. These tags are called as Part of Speech tags(POS).

How to identify the part of speech of the words in a text document?

It is present in the **pos_** attribute.

In [7]:
# POS tagging using spaCy

my_text = "John plays basketball, if time permit. He played in the high school too."

my_doc = nlp(my_text)

for token in my_doc:
    print(token.text, "------", token.pos_)

John ------ PROPN
plays ------ VERB
basketball ------ NOUN
, ------ PUNCT
if ------ SCONJ
time ------ NOUN
permit ------ VERB
. ------ PUNCT
He ------ PRON
played ------ VERB
in ------ ADP
the ------ DET
high ------ ADJ
school ------ NOUN
too ------ ADV
. ------ PUNCT


You can see the POS tag against each word like VERB, ADJ etc.

What if you don't know what the tag SCONJ means?

Using **spacy.explain()** function, you can know the explanation or full-form in this case.

In [11]:
#spacy.explain('SCONJ')
spacy.explain('ADP')

'adposition'

## POS tagging helps you in dealing with text based problems

Consider you have a text document of reviews or comments on a post. Apart from genuine words, there will be certain junk like "etc" which do not mean anything. How can you remove them?

Using spacy's **pos_** attribute, you can check if a particular token is junk through **token.pos_ ==  'X'** and remove them. 

In [12]:
# Raw text document

raw_text = """I liked the movies etc The movie had good direction The movie The movie was average direction was not bad The cinematography The movie was a bit lengthy otherwise fantastic etc etc"""

raw_doc = nlp(raw_text)

#Check if POS tag is X and printing
print('The junk values are ..')

for token in raw_doc:
    if token.pos_ == 'X':
        print(token.text)
        
print('After removing junk')

# Removing the tokens whose POS tag is junk

clean_doc = [token for token in raw_doc if not token.pos_ == 'X']
print(clean_doc)

The junk values are ..
etc
etc
etc
After removing junk
[I, liked, the, movies, The, movie, had, good, direction, The, movie, The, movie, was, average, direction, was, not, bad, The, cinematography, The, movie, was, a, bit, lengthy, otherwise, fantastic]


**You can also know what types of token are present in your text by creating a dictionary**

In [13]:
#Creating a dictionary with parts of speech &amp; corresponding token numbers.

all_tags = {token.pos:token.pos_ for token in raw_doc}
print(all_tags)

{95: 'PRON', 100: 'VERB', 90: 'DET', 92: 'NOUN', 101: 'X', 84: 'ADJ', 87: 'AUX', 94: 'PART', 86: 'ADV'}


In [14]:
# Importing displacy
from spacy import displacy
my_text = "She never like playing, reading was her hobby"
my_doc = nlp(my_text)

# Displaying token with their POS tags
displacy.render(my_doc, style='dep', jupyter=True)