# Introduction to

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/SpaCy_logo.svg/1200px-SpaCy_logo.svg.png" height=150>


SpaCy is a free open-source library for advanced Natural Language Processing in Python.


### Spacy Models
***

SpaCy currently offers statistical models for a variety of different languages and these can be installed as Python packages. 

These models have been custom-designed to give a high-performance mix of speed and accuracy. 

To get started we simply need to import SpaCy and load in one of the Spacy models.

In [0]:
import spacy

#import the default model which is english-core-web.
import en_core_web_sm

### NLP 
***
At the center of SpaCy is an object containing the processing pipeline. We usually call this variable "nlp".


It contains language specific rules for tokenizing the text into words and punctuation.

In [0]:
# Create the nlp object by loading in the model
nlp = en_core_web_sm.load()

### Doc 
***
When you process text with the nlp object, SpaCy creates a "doc" object - short for document.


The doc object is a container for linguistic annotations and it lets you access information about the text in a structured way.

In [5]:

doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

Apple is looking at buying U.K. startup for $1 billion.


### Tokenization
***

During processing, SpaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token. Each Doc consists of individual tokens, and we can iterate over them:

In [4]:

for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion
.


### Linguistic annotations
***

After tokenization, SpaCy can parse and tag a given Doc. This is where the statistical model comes in, which enables SpaCy to make a prediction of which tag or label most likely applies in this context.

Linguistic annotations are available as Token attributes that give insights into a text’s grammatical structure. e.g. if it consists of alphanumeric characters, punctuation or whether it resembles a number. They can also show part of speech, base word formations and much more.



In [6]:
#Display the Part-Of-Speech for each token in the document
for token in doc:
    print(token.text, token.pos_)

Apple PROPN
is VERB
looking VERB
at ADP
buying VERB
U.K. PROPN
startup NOUN
for ADP
$ SYM
1 NUM
billion NUM
. PUNCT


## Challenges:
***

Using the above example of getting a tokens Part-Of-Speech complete the following challenges.

The list of SpaCy token attributes can be found here - https://spacy.io/api/token#attributes

1. Check to see if any of the tokens consist of digits.

2. Check to see if any of the tokens are alphanumeric characters.

3. Find the base-form (hint: *lemma*) for each of the tokens.


In [23]:
#Add code here to check each token, display its text and display if it contains a digit or not

for token in doc: 
  if token.text.isdigit() == True:
    print('%s contains a digit' % token)
  else:
    print('%s does not contain a digit' % token)

Apple does not contain a digit
is does not contain a digit
looking does not contain a digit
at does not contain a digit
buying does not contain a digit
U.K. does not contain a digit
startup does not contain a digit
for does not contain a digit
$ does not contain a digit
1 contains a digit
billion does not contain a digit
. does not contain a digit


In [25]:
#Add code here to check each token, display its text and display if it contains a alphanumeric character or not

for token in doc: 
  if token.text.isalnum() == True:
    print('%s contains an alphanumeric character' % token)
  else:
    print('%s does not contain an alphanumeric character' % token)

Apple contains an alphanumeric character
is contains an alphanumeric character
looking contains an alphanumeric character
at contains an alphanumeric character
buying contains an alphanumeric character
U.K. does not contain an alphanumeric character
startup contains an alphanumeric character
for contains an alphanumeric character
$ does not contain an alphanumeric character
1 contains an alphanumeric character
billion contains an alphanumeric character
. does not contain an alphanumeric character


In [26]:
#Add code here to check each token, display its text and display its base word form

for token in doc: 
  print(token, token.lemma_)

Apple Apple
is be
looking look
at at
buying buy
U.K. U.K.
startup startup
for for
$ $
1 1
billion billion
. .


### Syntactic Dependencies
***

In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

In [27]:
for token in doc:
    print(token.text, token.dep_)
   

Apple nsubj
is aux
looking ROOT
at prep
buying pcomp
U.K. compound
startup dobj
for prep
$ quantmod
1 compound
billion pobj
. punct


SpaCy also comes with a built-in dependency visualizer to view how the words are related, to use it we simply need to import displacy  and render  our dependencies:

In [28]:
from spacy import displacy

spacy.displacy.render(doc, style='dep', jupyter=True)

### Named Entities
***

Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country. A list of all of the named entities that SpaCy can recognise can be found here - https://spacy.io/api/annotation#named-entities


Named entities are available as the "ents" property of a Doc.

The example below shows the start and end index of the entity in relation to the document and also shows the label SpaCy has assigned it by asking the model for a prediction.

In [29]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


We can also use the displacy to view the named entities in a nice way, but rendering it with style set to "ent".

In [30]:
spacy.displacy.render(doc, style='ent',jupyter=True)

# Challenges

1. Extract and display all peoples names from Challenge1.txt file and display the total amount of names found.

In [45]:
import requests

response = requests.get("https://ai-camp-content.s3.amazonaws.com/Challenge1.txt")
data = response.text

#Pass the file into the NLP object to create a doc
new_doc = nlp(data)


#Create an array list to store peoples names into
names = []


#Loop through the doc and and look for the correct entitiy type 
for ent in new_doc.ents:
  if ent.label_ == 'PERSON':
    names.append(ent.text)
  #add the person name to the array list
  
  
#Print out all of the names found and the total amount
print(names)

['Ed Sheeran', 'Taylor Swift', 'Chance the Rapper', 'Justin Bieber', 'Stormzy', 'Travis Scott', 'Eminem', 'Camila Cabello', 'Ed']


2.  Read in the challenge2.txt file and find the total number of peoples names, organisations and dates within the document.

In [49]:
#Read in file
response = requests.get("https://ai-camp-content.s3.amazonaws.com/Challenge2.txt")
data = response.text
doc1 = nlp(data)

name = []
organisation = []
date = []

for ent in new_doc.ents:
  if ent.label_ == 'PERSON':
    name.append(ent.text)
  elif ent.label_ == 'ORG':
    organisation.append(ent.text)
  elif ent.label == 'DATE':
    date.append(ent.text)
    
total = len(name)+len(organisation)+len(date)

print(total)

16


3. Looking at the SpaCy documentation try and discover more of SpaCys capabilities and try and think of a unique use-case for it!

4. Fetch the `Challenge1.txt` and `Challenge2.txt` files without using the `requests.get` function.  Either fetch the files via command line, or manually upload them.