# Lab: Installing and working with spaCy


### Overview:
#### We will be running spaCy with 2 languages: English and German

### Depends On:
#### None

### Run time:
#### 30 mins

### STEP 1: Installing spaCy

In [None]:
Installing with pip

```bash
    $   pip install spacy
```

### STEP 2: Installing models and process text

In this practice we are going to work mostly with `English` and a bit with `German` languages so, we have to download and install corresponding models as follow:

```bash
   $  python -m spacy download en_core_web_sm
   $  python -m spacy download de_core_news_sm
```

In [2]:
! python -m spacy download en_core_web_sm
! python -m spacy download de_core_news_sm


[93m    Linking successful[0m
    /home/ubuntu/apps/anaconda/lib/python3.7/site-packages/en_core_web_sm
    -->
    /home/ubuntu/apps/anaconda/lib/python3.7/site-packages/spacy/data/en_core_web_sm

    You can now load the model via spacy.load('en_core_web_sm')

Collecting de_core_news_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.0.0/de_core_news_sm-2.0.0.tar.gz#egg=de_core_news_sm==2.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.0.0/de_core_news_sm-2.0.0.tar.gz (38.2MB)
[K    100% |████████████████████████████████| 38.2MB 43.0MB/s ta 0:00:01
[?25hInstalling collected packages: de-core-news-sm
  Running setup.py install for de-core-news-sm ... [?25ldone
[?25hSuccessfully installed de-core-news-sm-2.0.0

[93m    Linking successful[0m
    /home/ubuntu/apps/anaconda/lib/python3.7/site-packages/de_core_news_sm
    -->
    /home/ubuntu/apps/anaconda/lib/python3.7/site-packages/sp

### STEP 3: To be sure spaCy works

In [3]:
import spacy

nlp_en = spacy.load("en_core_web_sm")
doc_en = nlp_en(u"Elephant Scale works on big data.")
print([word.text for word in doc_en])  # printing every word in the text

# To see the process in other languages
nlp_de = spacy.load("de_core_news_sm")
doc_de = nlp_de(u"Elephant Scale arbeitet mit Big Data.")
print([word.text for word in doc_de])  # printing every word in the text

['Elephant', 'Scale', 'works', 'on', 'big', 'data', '.']
['Elephant', 'Scale', 'arbeitet', 'mit', 'Big', 'Data', '.']


### STEP 4: Deriving tokens, noun chunks and sentences

In [4]:
doc = nlp_en(u"Elephant Scale works on big data. The most important part of the company is 'human resources' department")

# Tokens
print('Tokens: ',[word.text for word in doc])

# Noun chunks
noun_chunks = list(doc.noun_chunks)
print('Noun chunks: ',noun_chunks)

# Sentences
sentences = list(doc.sents)
print ('Sentences: ',sentences)

Tokens:  ['Elephant', 'Scale', 'works', 'on', 'big', 'data', '.', 'The', 'most', 'important', 'part', 'of', 'the', 'company', 'is', "'", 'human', 'resources', "'", 'department']
Noun chunks:  [Elephant Scale, big data, The most important part, the company, human resources' department]
Sentences:  [Elephant Scale works on big data., The most important part of the company is 'human resources' department]


In [10]:
for s in list(doc.noun_chunks):
    print(s)

Elephant Scale
a revenue


### STEP 5: Getting part-of-speech

In [5]:
doc = nlp_en(u"Elephant Scale has a revenue of $100 million")
es = doc[0]
print(es)
print("Fine-grained POS tag:", es.pos_)
print("Coarse-grained POS tag:", es.tag_)
print("Word shape:", es.shape_)
print("Alphanumeric characters?", es.is_alpha)
print("Punctuation mark?", es.is_punct)

mil = doc[8]
print(mil)
print("Is million a digit?", mil.is_digit)
print("Like a number?", mil.like_num)
print("Like an email address?", mil.like_email)

Elephant
Fine-grained POS tag: PROPN
Coarse-grained POS tag: NNP
Word shape: Xxxxx
Alphanumeric characters? True
Punctuation mark? False
million
Is million a digit? False
Like a number? True
Like an email address? False


In [18]:
a = "Tesla builds fast electric cars in the Fremont factory.  It is founded by Elon Musk"

doc = nlp_en(a)

print ("---- nouns----")
for noun in doc.noun_chunks:
    print(noun)
    

---- nouns----
Tesla
   pos tag :  1
fast electric cars
   pos tag :  5
the Fremont factory
   pos tag :  9
It
   pos tag :  12
Elon Musk
   pos tag :  17


In [22]:
print (" --- entities ---")
for ent in doc.ents:
    print(ent)
    print(type(ent))

 --- entities ---
Fremont
<class 'spacy.tokens.span.Span'>
Elon Musk
<class 'spacy.tokens.span.Span'>


In [23]:
from spacy import displacy

In [31]:
displacy.serve(doc, style="ent", port=9000)


[93m    Serving on port 9000...[0m
    Using the 'ent' visualizer



172.17.0.1 - - [03/Oct/2019 21:18:59] "GET / HTTP/1.1" 200 1257
172.17.0.1 - - [03/Oct/2019 21:18:59] "GET /favicon.ico HTTP/1.1" 200 1257



    Shutting down server on port 9000.



In [29]:

displacy.render(doc, style="ent", jupyter=True)
displacy.render(doc, style="dep", jupyter=True)