**SpaCy Introduction to NLP**

In [None]:
!pip install -U spacy

Collecting spacy
  Downloading spacy-3.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.2 MB)
[K     |████████████████████████████████| 6.2 MB 5.3 MB/s 
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.7-py3-none-any.whl (17 kB)
Collecting typing-extensions<4.0.0.0,>=3.7.4
  Downloading typing_extensions-3.10.0.2-py3-none-any.whl (26 kB)
Collecting spacy-legacy<3.1.0,>=3.0.9
  Downloading spacy_legacy-3.0.9-py2.py3-none-any.whl (20 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.5 MB/s 
Collecting srsly<3.0.0,>=2.4.3
  Downloading srsly-2.4.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (457 kB)
[K     |████████████████████████████████| 457 kB 48.0 MB/s 
[?25hCollecting thinc<8.1.0,>=8.0.14
  Downloading thinc-8.0.15-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (653 kB)
[K     |████████████████████████████████| 653 kB 38.6 MB/s 
Collecting langco

In [None]:
!pip install -U spacy-lookups-data
!python -mspacy download en_core_web_sm

Collecting spacy-lookups-data
  Downloading spacy_lookups_data-1.0.3-py2.py3-none-any.whl (98.5 MB)
[K     |████████████████████████████████| 98.5 MB 112 kB/s 
Installing collected packages: spacy-lookups-data
Successfully installed spacy-lookups-data-1.0.3
Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 5.1 MB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.3.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Spacy - open source free library for advance NLP
- Features:
1. Tokenization - breaking down text in words, punctuation marks, etc.
2. POS - Assigning word types
3. Dependency Parsing - Assigning syntactic dependency labels describing the relations between  individual tokens
4. Lemmatization - Extracting base form of words
5. Sentence boundary detection - Finding and segmenting the individual sentences
6. Named Entity Recognition - Labelling named “real-world” objects, like persons, companies or locations
7. Entity Linking - Disambiguating the textual entities to unique identifier in a knowledge base
8. Similarity - comparing words
9. Text Classification - assign category to a document or its parts
10. Rule-Based Matching - Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions
11. Training - Updating and improving a statistical model’s predictions
12. Serialization - Saving objects to files or byte strings.

1. **Tokenization**

In [None]:
import spacy 

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
help(nlp)

Help on English in module spacy.lang.en object:

class English(spacy.language.Language)
 |  English(vocab: Union[spacy.vocab.Vocab, bool] = True, *, max_length: int = 1000000, meta: Dict[str, Any] = {}, create_tokenizer: Union[Callable[[ForwardRef('Language')], Callable[[str], spacy.tokens.doc.Doc]], NoneType] = None, batch_size: int = 1000, **kwargs) -> None
 |  
 |  A text-processing pipeline. Usually you'll load this once per process,
 |  and pass the instance around your application.
 |  
 |  Defaults (class): Settings, data and factory methods for creating the `nlp`
 |      object and processing pipeline.
 |  lang (str): IETF language code, such as 'en'.
 |  
 |  DOCS: https://spacy.io/api/language
 |  
 |  Method resolution order:
 |      English
 |      spacy.language.Language
 |      builtins.object
 |  
 |  Data and other attributes defined here:
 |  
 |  Defaults = <class 'spacy.lang.en.EnglishDefaults'>
 |      Language data defaults, available via Language.Defaults. Can be


In [None]:
text = "Apple is looking at buying U.K. startup. Government has given permission for acquisition."
doc = nlp(text)
doc

Apple is looking at buying U.K. startup. Government has given permission for acquisition.

In [None]:
for sent in doc.sents:
  print(sent)

Apple is looking at buying U.K. startup.
Government has given permission for acquisition.


**Phrase matching**
- Can be done in two ways
1. Token-Based Matching 
2. Phrase Matcher

In [None]:
from spacy.matcher import Matcher
from spacy.tokens import Span

In [None]:
text = 'Hello, world! hello world'

In [None]:
doc = nlp(text)

In [None]:
for token in doc:
  print(token)

Hello
,
world
!
hello
world


In [None]:
pattern = [{'LOWER':'hello'},{'IS_PUNCT':True,'OP':'?'},{'LOWER':'world'}]

In [None]:
matcher = Matcher(nlp.vocab)
#matcher.add('hw',None,pattern)
matcher.add('hw',[pattern],on_match=None)

In [None]:
matches = matcher(doc)
matches

[(17790654416186116455, 0, 3), (17790654416186116455, 4, 6)]

In [None]:
for match_id,start,end in matches:
  string_id = nlp.vocab.strings[match_id]
  span = doc[start:end]
  print(match_id,string_id,start,end,span.text)

17790654416186116455 hw 0 3 Hello, world
17790654416186116455 hw 4 6 hello world


In [None]:
text

'Hello, world! hello world'

**Processing a pipeline in SpaCy**
- Pipeline is used by a default models consist of a tagger, a parser, and an entity recognizer. 
- Each pipeline component returns the processed document, which is then passed on to the next component

### **Tips:**
- Process the text as a stream using NLP.
-Pipe and buffer them in batches, instead of one by one. This is usually much more efficient.
- Only apply the component you need. To prevent this, disable keyword argument to disable the component you don't need.

In [None]:
import spacy

In [None]:
text = ['net income was $9.4 million compared to the prior year of 2.7$ million',
        'revenue exceeds twelve billion dollars with a loss of $1b']

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
%%timeit
docs = nlp.pipe(text, disable = ['tagger','parser'])

for doc in docs:
  for ent in doc.ents:
    print(ent.text,ent.label_)
  print()



$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
t

In [None]:
%%timeit
docs = nlp.pipe(text)

for doc in docs:
  for ent in doc.ents:
    print(ent.text,ent.label_)
  print()

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
the prior year DATE
2.7$ million MONEY

twelve billion dollars MONEY
1b MONEY

$9.4 million MONEY
t

### Hashtags and Emoji detection

In [None]:
from spacy.lang.en import English
from spacy.matcher import Matcher

nlp = English()  
matcher = Matcher(nlp.vocab)

pos_emoji = ["😀", "😃", "😂", "🤣", "😊", "😍"]  # Positive emoji
neg_emoji = ["😔", "😠", "😩", "😢", "😭", "😒"]  # Negative emoji

In [None]:
pos= [[{'ORTH': emoji}] for emoji in pos_emoji]
neg= [[{'ORTH': emoji}] for emoji in neg_emoji]

pos,neg

([[{'ORTH': '😀'}],
  [{'ORTH': '😃'}],
  [{'ORTH': '😂'}],
  [{'ORTH': '🤣'}],
  [{'ORTH': '😊'}],
  [{'ORTH': '😍'}]],
 [[{'ORTH': '😔'}],
  [{'ORTH': '😠'}],
  [{'ORTH': '😩'}],
  [{'ORTH': '😢'}],
  [{'ORTH': '😭'}],
  [{'ORTH': '😒'}]])

In [None]:
def label_sentiment(matcher,doc,i,matches):
  match_id,start,end = matches[i]
  if doc.vocab.strings[match_id] == 'happy':
    doc.sentiment += .1
  elif doc.vocab.strings[match_id] == 'sad':
    doc.sentiment -+ .1

In [None]:
matcher.add('happy', pos, on_match = label_sentiment)
matcher.add('sad', neg, on_match = label_sentiment)

In [None]:
matcher.add('HASHTAG',[[{'ORTH':'#'},{'IS_ASCII':True}]])

In [None]:
doc = nlp(' Congratulations😍 You are about to complete your Level 3 😃😃 farewll 😭  #Megha')

In [None]:
matches = matcher(doc)

In [None]:
for match_id, start, end in matches:
    string_id = doc.vocab.strings[match_id]  # Look up string ID
    span = doc[start:end]
    print(string_id, span.text)

happy 😍
happy 😃
happy 😃
sad 😭
HASHTAG #Megha


In [None]:
doc.sentiment

0.30000001192092896