# Lab: Installing and working with spaCy


### Overview:
#### We will be running spaCy with 2 languages: English and German

### Depends On:
#### None

### Run time:
#### 30 mins

### STEP 1: Installing spaCy

In [None]:
Installing with pip

```bash
    $   pip install spacy
```

### STEP 2: Installing models and process text

In this practice we are going to work mostly with `English` and a bit with `German` languages so, we have to download and install corresponding models as follow:

```bash
   $  python -m spacy download en_core_web_sm
   $  python -m spacy download de_core_news_sm
```

### STEP 3: To be sure spaCy works

In [1]:
import spacy

nlp_en = spacy.load("en_core_web_sm")
doc_en = nlp_en(u"Elephant Scale works on big data.")
print([word.text for word in doc_en])  # printing every word in the text

# To see the process in other languages
nlp_de = spacy.load("de_core_news_sm")
doc_de = nlp_de(u"Elephant Scale arbeitet mit Big Data.")
print([word.text for word in doc_de])  # printing every word in the text

['Elephant', 'Scale', 'works', 'on', 'big', 'data', '.']
['Elephant', 'Scale', 'arbeitet', 'mit', 'Big', 'Data', '.']


### STEP 4: Deriving tokens, noun chunks and sentences

In [3]:
doc = nlp_en(u"Elephant Scale works on big data. The most important part of the company is 'human resources' department")

# Tokens
print('Tokens: ',[word.text for word in doc])

# Noun chunks
noun_chunks = list(doc.noun_chunks)
print('Noun chunks: ',noun_chunks)

# Sentences
sentences = list(doc.sents)
print ('Sentences: ',sentences)

Tokens:  ['Elephant', 'Scale', 'works', 'on', 'big', 'data', '.', 'The', 'most', 'important', 'part', 'of', 'the', 'company', 'is', "'", 'human', 'resources', "'", 'department']
Noun chunks:  [Elephant Scale, big data, The most important part, the company, human resources' department]
Sentences:  [Elephant Scale works on big data., The most important part of the company is 'human resources' department]


### STEP 5: Getting part-of-speech

In [5]:
doc = nlp_en(u"Elephant Scale has a revenue of $100 million")
es = doc[0]
print(es)
print("Fine-grained POS tag:", es.pos_)
print("Coarse-grained POS tag:", es.tag_)
print("Word shape:", es.shape_)
print("Alphanumeric characters?", es.is_alpha)
print("Punctuation mark?", es.is_punct)

mil = doc[8]
print(mil)
print("Is million a digit?", mil.is_digit)
print("Like a number?", mil.like_num)
print("Like an email address?", mil.like_email)

Elephant
Fine-grained POS tag: PROPN
Coarse-grained POS tag: NNP
Word shape: Xxxxx
Alphanumeric characters? True
Punctuation mark? False
million
Is million a digit? False
Like a number? True
Like an email address? False
