In [1]:
!pip install -U spacy-lookups-data

Collecting spacy-lookups-data
  Downloading https://files.pythonhosted.org/packages/3c/f1/be61b032e02a06a221e14f906dc251de90ac459dc2739f0c5225844ecb08/spacy_lookups_data-0.2.0.tar.gz (29.2MB)
Building wheels for collected packages: spacy-lookups-data
  Building wheel for spacy-lookups-data (setup.py): started
  Building wheel for spacy-lookups-data (setup.py): finished with status 'done'
  Stored in directory: C:\Users\Ashish\AppData\Local\pip\Cache\wheels\79\a4\b8\6085d282396938b29675292697e72871b145990d0079ceadc1
Successfully built spacy-lookups-data
Installing collected packages: spacy-lookups-data
Successfully installed spacy-lookups-data-0.2.0


- In machine learning, there are two types of models:
    1. Statistical model 
    2. Rule based model
    
- Statistical models are models which are already trained with alot of data and we just have to implement them for getting our analysis done.
- But sometimes we have to deal with data that cannot be handled using statistical model. That is where Rule based model comes into picture.
- In rule based model, we set our own rules in order to analyze the data.
- In NLP, rule based matching can be done using regular expressions and by using Spacy's Rule based matcher.
- Spacys rule based matcher is more flexible. They let you find the words and phrases you are looking for and they also give you access to the tokens within the documents and their relationships.

- There is also token based matching in which we convert the documents into tokens and then apply matching rules on it.
- This means, if a token matches with the mentioned rule, then we can extract that token 

In [3]:
import spacy

In [4]:
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy

In [5]:
nlp = spacy.load("en_core_web_sm")

In [6]:
doc = nlp("Hello World!")

In [7]:
doc

Hello World!

In [8]:
for token in doc:
    print(token)

Hello
World
!


### Perform Token Based Matching

### Regular Expression

- In token based matching, we try to match and find the exact string. But in most cases, we have to find the similar type of string and not the exact one.
- Eg: How many times did YouTube occured in a doc? Or how many times did Ashish Mehta occur in a document.
- For the above example, we can do hard coding ie Token based Matching.
- But for eg: We are looking for all the three character words in a document [dog, mat, bat, sat]. These words can be anything
- For extracting these three chracter words, we have to define a generalized Rule which can look for 3 character words in a document and extract them.
- This can be done by using Regular Expression.

#### Meta Characters in Regular Expressions:

- []    - Matches any character contained between the [] brackets
- [^ ]  - Matches any character that is not contained between the [] brackets
- [*]    - Matches zero or more repetitions of preceeding symbols
- [+]    - Matches one or more repetitions of preceeding symbols
- ?     - Makes the preceeding symbol optional.
- {n,m} - Braces. Matches n but not more than m reprtitions of the preceeding symbols
- (xyz) - Character group. Matches the character xyz in the exact order.
- |     - Alternation. Matches either the characters before or characters after the symbol
- \     - Escapes the next character. This allows you to match reserved characters [] () {} . * + ? ^ $ \ |.
- ^     - Matches the beginning of the input
- $     - Matches the end of the input

#### Identifiers And Quantifiers are very necessary in Regular expressions

- Look for Identifiers and Quantifiers in Regular expressions online

In [20]:
text = "My name is Ashish. My number is 1562. Ohh, thats a wrong number! My correct Number is 9900748547"

In [21]:
# import regular expression
import re

In [22]:
# find all 10 digit number in the text
re.search(r"\d{10}", text)

# \d means digit
# {10} is a quantifier to find 10 digit number

<re.Match object; span=(86, 96), match='9900748547'>

In [23]:
# find all 4 digit number in text
re.search(r"\d{4}", text)

<re.Match object; span=(32, 36), match='1562'>

In [24]:
# find all the numbers in text which have more than 3 digits but less than 4 digits

re.findall(r"\d{3,10}", text)

['1562', '9900748547']

- To find more than one result, we make use of **findall**

In [25]:
# Find all the words which have ATLEAST 4 characters

re.findall(r"\w{4}", text) # over here, we can see that some of the words are trunkated. [Cut-short]. 

['name',
 'Ashi',
 'numb',
 '1562',
 'that',
 'wron',
 'numb',
 'corr',
 'Numb',
 '9900',
 '7485']

In [26]:
# To deal with trunkation, we make use of ,

re.findall(r"\w{4,}", text)

['name',
 'Ashish',
 'number',
 '1562',
 'thats',
 'wrong',
 'number',
 'correct',
 'Number',
 '9900748547']

### WildCard Text

- We want to find a word that starts with letter n. How do we do that?

In [33]:
re.findall(r"n.....", text)

['name i', 'number', 'ng num']

- be careful with the number of dots you are putting in the bracket as only those many characters of the word will be displayed

In [34]:
re.findall(r"Ashish", text)

['Ashish']

In [35]:
# Creating a new text

text = "This is a cat but not that. I want a cat and a hat"

In [37]:
# find all three letter words that has a in between

re.findall(r".a.", text) # white space also counts

[' a ', 'cat', 'hat', 'wan', ' a ', 'cat', ' an', ' a ', 'hat']

In [42]:
# Find a number at the end of the text

text = "Thanks for subscribing <3"

In [44]:
re.findall(r"\d$", text) # This will only work if the digit is at the absolute end of the text.

['3']

- The above code means, this string ends with a number.

In [45]:
text = "9 Thanks for subscribing <3"

In [46]:
re.findall(r"^\d", text) # This means, the string starts with a number

['9']

### Exclusion in Regular Expression

- Exclusion means extracting the data by excluding something.

In [47]:
# Lets say we have to extract all the text but without the numbers

text = "9 Thanks for subscribing <3"

- ^ sign inside [] bracket indicates negate operation

In [48]:
re.findall(r"[^\d]", text)

[' ',
 'T',
 'h',
 'a',
 'n',
 'k',
 's',
 ' ',
 'f',
 'o',
 'r',
 ' ',
 's',
 'u',
 'b',
 's',
 'c',
 'r',
 'i',
 'b',
 'i',
 'n',
 'g',
 ' ',
 '<']

In [49]:
# If we want the sentence without the numbers

re.findall(r"[^\d]+", text)

[' Thanks for subscribing <']

In [54]:
# If we want to get all the numbers in the text

re.findall(r"[^\D]+", text)

['9', '3']

In [55]:
# If we want to extract all words with - sign

text = "Thanks for Ashish-Mehta. I am The-Best"

In [58]:
re.findall(r"[\w]+-+[\w]+", text)

['Ashish-Mehta', 'The-Best']

- There are 3 + signs. The last + sign is for combining all the letters of the word together.

### Regular Expression in Spacy

In [66]:
text = "Google announced a new pixel at Google I/O. Google I/O is a great place to get all updates from Google."

In [87]:
# In this text we have to find how many times has Google I/O appeared in the sentence.

pattern = [{"TEXT":"Google"}, {"TEXT":"I"}, {"TEXT":"/"}, {"TEXT":"O"},{"IS_PUNCT":True, "OP":"?"}]

# "OP":"?" means print when there is punctuation and when there is no punctuation with Google I/O 

In [95]:
# We create a callback function, to call the output once the match is found

def callback (matcher, doc, i, matches):
    match_id, start, end = matches[i]
    entity = doc[start:end]
    print(entity.text)
    
# Here, match_id is the id of the match found in the text
# start, end is from where the doc will start and end. doc[start:end] this describes the entire span of doc from start to end
# doc is the document on which the matcher is run
# matches[i] stands for number of matches and i stands for index of tokens

In [89]:
matcher = Matcher(nlp.vocab)
matcher.add("Google", callback, pattern)

In [90]:
doc = nlp(text)

In [91]:
matcher(doc)

Google I/O
Google I/O .
Google I/O


[(11578853341595296054, 6, 10),
 (11578853341595296054, 6, 11),
 (11578853341595296054, 11, 15)]

### Linguistic Annotations:

##### Lets say you are analyzing your comments and you want to find out, what people are saying about facebook. You want to start off by finding adjectives following "Facebook is" or "Facebook was". This is a very rudimentary solution, but it will be fast, and a great way to get an idea for whats in your data. Your pattern could look like this.

**pattern = [{"LOWER": "facebook"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"}]**

- This translates to a token whose lowercase form matches Facebook (like Facebook, facebook or FACEBOOK), followed by a token with a lemma be (is,was,or) followed by an optional adverb followed by an adjective.

- Linguistic annotations can be used for sentimental analysis.

- Annotations are available on: https://spacy.io/api/annotations 

In [92]:
matcher = Matcher(nlp.vocab)

In [93]:
matched_sents = []

In [94]:
pattern = [{"LOWER": "facebook"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"}]

In [96]:
def callback (matcher, doc, i, matches):
    match_id, start, end = matches[i] # One match returns a tuple that contains match_id, start and end
    span = doc[start:end] # entire document from start to end
    sent = span.sent
    
    match_ents = [{                                     # Created our own entity. It will highlight if a match is found  
        "start": span.start_char - sent.start_char,
        "end": span.end_char - sent.start_char,
        "label": "MATCH"
    }]
    
    matched_sents.append({"text": sent.text, "ents":match_ents})

In [97]:
matcher.add("fb", callback ,pattern)

In [98]:
doc = nlp("I'd say that facebook is evil. - Facebook is pretty cool, right?")

In [99]:
matches = matcher(doc)

In [100]:
matches

[(8017838677478259815, 4, 7), (8017838677478259815, 9, 13)]

In [101]:
matched_sents

[{'text': "I'd say that facebook is evil.",
  'ents': [{'start': 13, 'end': 29, 'label': 'MATCH'}]},
 {'text': '- Facebook is pretty cool, right?',
  'ents': [{'start': 2, 'end': 25, 'label': 'MATCH'}]}]

In [102]:
displacy.render(matched_sents, style = "ent", manual = True)

### Extracting Phone Number

**Phone numbers can have many different formats and matching them can often be very tricky. During Tokenization, Spacy will leave sequences of numbers intact and only split on whitespace and punctuations. This means that your match patterns will have to look out for number sequences of a certain length, surrounded by specific punctuations - depending on the national conventions**

**You want to match like this: (123) 4567 8901 or (123) 4567-8901**


 **pattern = [{"ORTH":"("}, {"SHAPE": "ddd"}, {"ORTH":")"}, {"SHAPE":"dddd"}, {"ORTH":"-", "OP":"?"}, {"SHAPE":"dddd"}]**

In [104]:
pattern = [{"ORTH":"("}, {"SHAPE": "ddd"}, {"ORTH":")"}, {"SHAPE":"dddd"}, {"ORTH":"-", "OP":"?"}, {"SHAPE":"dddd"}]

In [105]:
matcher = Matcher(nlp.vocab)

In [106]:
matcher.add("PhoneNumber", None, pattern)

In [107]:
doc = nlp("Call me at (123) 4857 6254 or call me at (154) 5874-5625")

In [108]:
# tokenizing the sentence
print([t.text for t in doc])

['Call', 'me', 'at', '(', '123', ')', '4857', '6254', 'or', 'call', 'me', 'at', '(', '154', ')', '5874', '-', '5625']


In [109]:
matches = matcher(doc)

In [110]:
matches

[(7978097794922043545, 3, 8), (7978097794922043545, 12, 18)]

In [111]:
# printing the actual numbers in the sentence

for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

(123) 4857 6254
(154) 5874-5625


### Extracting email addresses from a text

In [112]:
pattern = [{"TEXT": {"REGEX": "[a-zA-Z0-9-_.]+@[a-zA-Z0-9.]+"}}]

In [113]:
doc = nlp("please msg me at xyz@gmail.com or at ags_123@hotmail.com")

In [114]:
matcher = Matcher(nlp.vocab)

In [116]:
matcher.add("Email", None, pattern) #"Email" is the name of matcher, Callback function is none, pattern is the mentioned pattern

In [117]:
matches = matcher(doc)

In [118]:
matches

[(11010771136823990775, 4, 5), (11010771136823990775, 7, 8)]

In [119]:
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

xyz@gmail.com
ags_123@hotmail.com


### Hashtags and Emoji Detection and Extraction on social media

**Social media posts, especially tweets, can be difficult to work with. They're very short and often contain various emojis and hashtags. By only looking at the plain text, we will lose alot of valuable semantic information.**

- By looking at hashtags and emojis, we can detect the sentiment of a post
- Eg: If a person has posted a post with alot of laughing emojis, that means the post sentiment is happy. Whereas if there are alot of crying emojis, then the sentiment of the post is sad.

In [121]:
!pip install emoji



In [122]:
pattern = [{"TEXT":"#"}, {"IS_ASCII": True}]

In [123]:
doc = nlp("#Sadak ekdum kadak. I love myself. #Happylife")

In [124]:
matcher = Matcher(nlp.vocab)

In [125]:
matcher.add("Hashtag", None, pattern)

In [126]:
matches = matcher(doc)

In [127]:
for match_id, start, end in matches:
    span = doc[start:end]
    print(span)

#Sadak
#Happylife


### Efficient Phrase Matching

**If you want to match large terminology list, we can also use the PhraseMatcher and create Doc objects instead of token patterns, which is much more efficient overall. The doc patterns can contain single or multiple tokens.**

In [128]:
from spacy.matcher import PhraseMatcher

In [129]:
matcher = PhraseMatcher(nlp.vocab)

In [130]:
terms = ["BARAK OBAMA", "ANGELA MERKEL", "WASHINGTON D.C"]

- We have to get these terms from the documents.

In [132]:
pattern = [nlp.make_doc(text) for text in terms]

# for each value in terms, search for the value in the document

In [133]:
pattern

[BARAK OBAMA, ANGELA MERKEL, WASHINGTON D.C]

In [135]:
matcher.add("term", None, *pattern)

In [142]:
doc = nlp("German Chancellor ANGELA MERKEL and US President BARAK OBAMA had a mmeting about world economy at WASHINGTON D.C")

In [143]:
doc

German Chancellor ANGELA MERKEL and US President BARAK OBAMA had a mmeting about world economy at WASHINGTON D.C

In [144]:
matches = matcher(doc)

In [145]:
for match_id, start, end in matches:
    span = doc[start:end]
    print(span)

ANGELA MERKEL
BARAK OBAMA
WASHINGTON D.C


In [146]:
matches

[(4519742297340331040, 2, 4),
 (4519742297340331040, 7, 9),
 (4519742297340331040, 16, 18)]

### Custom Rule Based Entity Recognition

The Entity Ruler is an exciting new component that lets you add named entities based on pattern dictionaries, and makes it easy to combine rule-based and statistical named entity recognition for even more powerful models

#### Entity Patterns

**Entity patterns are dictionaries with two keys: "label", specifying the label to assign to the entity if the pattern is matched, and "pattern", the match pattern. The entity ruler accepts two types of patterns:**
        
    1. PHRASE PATTERN: {"label":"ORG", "pattern": "KGP talkie"}
    2. TOKEN PATTERN: {"label":"ORG", "pattern": [{"LOWER":"san"},{"LOWER":"fransisco"}]}

**The Entity Ruler is a pipeline comment that's typically added via nlp.add_pipe. When the nlp object is called on a text, it will find matches in the doc and add them as entities to the doc.ents, using the specified pattern label as the entity label.**

In [3]:
import spacy
from spacy.pipeline import EntityRuler

In [4]:
nlp = spacy.load("en_core_web_sm")

In [5]:
ruler = EntityRuler(nlp)

In [6]:
# Create a pattern for KGP talkie

pattern = [{"label":"ORG", "pattern": "KGP talkie"},
           {"label":"GPE", "pattern": [{"LOWER":"san"},{"LOWER":"fransisco"}]}]

In [7]:
pattern

[{'label': 'ORG', 'pattern': 'KGP talkie'},
 {'label': 'GPE', 'pattern': [{'LOWER': 'san'}, {'LOWER': 'fransisco'}]}]

In [8]:
ruler.add_patterns(pattern)

In [9]:
nlp.add_pipe(ruler)

In [10]:
doc = nlp("KGP talkie is opening its first big office in San Fransisco")

In [11]:
doc

KGP talkie is opening its first big office in San Fransisco

In [12]:
for ent in doc.ents:
    print(ent.text, ent.label_)

KGP ORG
first ORDINAL
San Fransisco GPE
