<a href="https://colab.research.google.com/github/anajikadam17/Google-Colab/blob/main/NLP/spaCy__.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# spaCy

I’ve listed below the different statistical models in spaCy along with their specifications:

- en_core_web_sm: English multi-task CNN trained on OntoNotes. Size – 11 MB
- en_core_web_md: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size – 91 MB
- en_core_web_lg: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size – 789 MB

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
st1 = "He went to play basketball"

# Create an nlp object
doc = nlp(st1 )

In [None]:
nlp.pipe_names   #  active pipeline components

['tagger', 'parser', 'ner']

In [None]:
# to disable the pipeline components and keep only the tokenizer up and running
nlp.disable_pipes('tagger', 'parser')

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7fd48b0fda50>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7fd48b248830>)]

In [None]:
nlp.pipe_names


['ner']

### Part-of-Speech (POS) Tagging using spaCy
In English grammar, the parts of speech tell us *what is the function of a word and how it is used in a sentence*. Some of the common parts of speech in English are *Noun, Pronoun, Adjective, Verb, Adverb*, etc.

POS tagging is the task of automatically assigning POS tags to all the words of a sentence. It is helpful in various downstream tasks in NLP, such as ***feature engineering, language understanding, and information extraction***.

Performing POS tagging, in spaCy,

In [None]:
# import spacy 
# nlp = spacy.load('en_core_web_sm')

# Create an nlp object
doc = nlp("He went to play basketball, he is Good in Play")
 
# Iterate over the tokens
for token in doc:
    # Print the token and its part-of-speech tag
    print(token.text, "-->", token.pos_)

He --> PRON
went --> VERB
to --> PART
play --> VERB
basketball --> NOUN
, --> PUNCT
he --> PRON
is --> AUX
Good --> ADJ
in --> ADP
Play --> NOUN


In [None]:
print("{} : {}".format("PART",spacy.explain("PART")))
print("{} : {}".format("VERB",spacy.explain("VERB")))
print("{} : {}".format("NOUN",spacy.explain("NOUN")))
print("{} : {}".format("PRON",spacy.explain("PRON")))
print("{} : {}".format("PUNCT",spacy.explain("PUNCT")))
print("{} : {}".format("AUX",spacy.explain("AUX")))
print("{} : {}".format("ADJ",spacy.explain("ADJ")))
print("{} : {}".format("ADP",spacy.explain("ADP")))

PART : particle
VERB : verb
NOUN : noun
PRON : pronoun
PUNCT : punctuation
AUX : auxiliary
ADJ : adjective
ADP : adposition


### Dependency Parsing using spaCy
Every sentence has a grammatical structure to it and with the help of dependency parsing, we can extract this structure. It can also be thought of as a directed graph, where nodes correspond to the words in the sentence and the edges between the nodes are the corresponding dependencies between the word.

Performing dependency parsing is again pretty easy in spaCy. We will use the same sentence here that we used for POS tagging:

In [None]:
# dependency parsing
for token in doc:
    print(token.text, "-->", token.dep_)
    print("{} : {}".format(token.dep_, spacy.explain(token.dep_)))
    print("---"*10)




He --> nsubj
nsubj : nominal subject
------------------------------
went --> ccomp
ccomp : clausal complement
------------------------------
to --> aux
aux : auxiliary
------------------------------
play --> advcl
advcl : adverbial clause modifier
------------------------------
basketball --> dobj
dobj : direct object
------------------------------
, --> punct
punct : punctuation
------------------------------
he --> nsubj
nsubj : nominal subject
------------------------------
is --> ROOT
ROOT : None
------------------------------
Good --> acomp
acomp : adjectival complement
------------------------------
in --> prep
prep : prepositional modifier
------------------------------
Play --> pobj
pobj : object of preposition
------------------------------


### Named Entity Recognition using spaCy
Entities are the words or groups of words that represent information about common things such as **persons, locations, organizations**, etc. These entities have proper names.

spaCy recognizes named entities in a sentence

In [None]:
st2 = "Indians spent over $71 billion on clothes in 2018, by PM Modi in India"


doc = nlp(st2)
 
for ent in doc.ents:
    print(ent.text, ent.label_)
    print("{} : {}".format(ent.label_, spacy.explain(ent.label_)))
    print("---"*10)

Indians NORP
NORP : Nationalities or religious or political groups
------------------------------
$71 billion MONEY
MONEY : Monetary values, including unit
------------------------------
2018 DATE
DATE : Absolute or relative dates or periods
------------------------------
PM Modi PERSON
PERSON : People, including fictional
------------------------------
India GPE
GPE : Countries, cities, states
------------------------------


### Rule-Based Matching using spaCy
Rule-based matching is a new addition to spaCy’s arsenal. With this spaCy matcher, you can find words and phrases in the text using user-defined rules.

While Regular Expressions use text patterns to find words and phrases, the spaCy matcher not only uses the text patterns but lexical properties of the word, such as POS tags, dependency tags, lemma, etc.

In [None]:
nlp.vocab[0]

<spacy.lexeme.Lexeme at 0x7fd48a238c30>

In [None]:
# import spacy
# nlp = spacy.load('en_core_web_sm')

# Import spaCy Matcher
from spacy.matcher import Matcher

# Initialize the matcher with the spaCy vocabulary
matcher = Matcher(nlp.vocab)


st3 = "Some people start their day with lemon water"
st4 = "Some people start their day with lemon water, lemon water is soft drink."

doc = nlp(st3 )
doc1 = nlp(st4)

# Define rule
pattern = [{'TEXT': 'lemon'}, {'TEXT': 'water'}]

# Add rule
matcher.add('rule_1', None, pattern)

in the code above:

First, we import the spaCy matcher
After that, we initialize the matcher object with the default spaCy vocabulary
Then, we pass the input in an NLP object as usual
In the next step, we define the rule/pattern for what we want to extract from the text.
Let’s say we want to extract the phrase “lemon water” from the text. So, our objective is that whenever “lemon” is followed by the word “water”, then the matcher should be able to find this pattern in the text. That’s exactly what we have done while defining the pattern in the code above. Finally, we add the defined rule to the matcher object.

In [None]:
matches = matcher(doc)  #The output has three elements. The first element is the match ID. 
# The second and third elements are the positions of the matched tokens.
matches

[(7604275899133490726, 6, 8)]

In [None]:
# Extract matched text
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

lemon water


In [None]:
matches = matcher(doc1)  
matches

[(7604275899133490726, 6, 8), (7604275899133490726, 9, 11)]

In [None]:
# Extract matched text
for match_id, start, end in matches:
  print(match_id, start, end)
  # Get the matched span
  matched_span = doc1[start:end]
  print(matched_span.text)

7604275899133490726 6 8
lemon water
7604275899133490726 9 11
lemon water


 For example, ‘TEXT’ is a token attribute that means the exact text of the token. There are, in fact, many other useful token attributes in spaCy which can be used to define a variety of rules and patterns. [https://spacy.io/usage/rule-based-matching](https://spacy.io/usage/rule-based-matching)

In [None]:
doc1 = nlp("You read this book")
doc2 = nlp("I will book my ticket")

pattern = [{'TEXT': 'book', 'POS': 'NOUN'}]  # when Book word is noun only

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
matcher.add('rule_2', None, pattern)

In [None]:
matches = matcher(doc1)
matches

[(375134486054924901, 3, 4)]

In [None]:
matches = matcher(doc2)  # word book present but it ignore because it is not noun
matches

[]

In [None]:
# give your filename here
with open("file.txt", "r") as fp:
  text = fp.read()

In [None]:
text

' \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\nSharad S Sawant \nRESUME SAMPLE \n\nfrom Resume Genius \n\nEmail:  youremail@gmail.com \n\nPhone: 895 555  555 \n\nAddress: 4397 Aaron Smith  Drive \nHarrisburg, PA  17101 \n\nLinkedin:  linkedin.com/in/yourprole \n\nR E S U M E  O B J E CT I V E \n\nHuman Resources Generalist with 6+ years of experience assisting with and fulfilling organization staffing needs \n\nand requirements. Aiming to use my dynamic communication and organization skills to achieve your HR \n\ninitiatives. Possess a BA in Human Resources Management and a Professional in Human Resources certification. \n\n \n\n \nSKILL\n \nS \n \n\n90WPM Typing Speed  Workday \n\nKronos \n\nConflict Management \n\nMS Office Suite \n\nTeamwork \n\n \n\nLeadership \n\nTime Management \n\nInterpersonal Communication \n\nAdaptability. \n\nPublic Speaking \n\nEXPERIENCE \nHR G

In [None]:
doc = nlp(text)

In [None]:
for ent in doc.ents:
    if ent.label_=="PERSON":
      print("{}, {}({})".format(ent.text, ent.label_, spacy.explain(ent.label_)))

Aaron Smith, PERSON(People, including fictional)
Zip, PERSON(People, including fictional)


In [None]:
for ent in doc.ents:
    if ent.label_=="GPE":
      print("{}, {}".format(ent.text, ent.label_))

India, GPE


In [None]:

doc = nlp(text)
matches = matcher(doc)

pattern = [{"TEXT": {"REGEX": r"[\w.-]+@[\w.-]+.[\w.-]+"}} ]
matcher.add( "email", None, pattern)


for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    if string_id=="email":
      span = doc[start:end]
      print(match_id, string_id, start, end)
      print("Emsil ID : ", span.text)

7320900731437023467 email 15 16
Emsil ID :  youremail@gmail.com


In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": {"REGEX": r"[\w.-]+@[\w.-]+.[\w.-]+"}} ]
matcher.add( "email", None, pattern)
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end)
    print( span.text)

7320900731437023467 email 15 16
youremail@gmail.com


# Regular Expressions

In [None]:
import re

# insert your text here
text = "my email id is anaji17@gmail.com and old email ak@gmail.com"

re.findall(r"[\w.-]+@[\w.-]+", text)

['anaji17@gmail.com', 'ak@gmail.com']

In [None]:
# give your filename here
with open("file.txt", "r") as fp:
  text = fp.read()
  
re.findall(r"[\w.-]+@[\w.-]+", text)

['youremail@gmail.com']

## Regular Expressions for Web Scraping

In [None]:
import requests
from bs4 import BeautifulSoup, Comment

URL = "https://en.wikipedia.org/wiki/Unsupervised_learning"

r = requests.get(URL)
soup = BeautifulSoup(r.content, "html")
type(soup)
# for text in soup.body.find_all(string=True):
#     if text.parent.name not in ['script', 'meta', 'link', 'style'] and not isinstance(text, Comment) and text != '\n':
#         print(text.strip())

# re.findall(r"https:+[\w.-]+.jpg+", soup)

bs4.BeautifulSoup

In [None]:
re.findall(r"\/wiki\/[\w-]*",  str(soup))[:5]

['/wiki/Unsupervised_learning',
 '/wiki/Unsupervised_learning',
 '/wiki/Machine_learning',
 '/wiki/Data_mining',
 '/wiki/File']

In [None]:
re.findall(r">([\w\s()]*?)</a>", str(soup))[:4]

['', 'Jump to navigation', 'Jump to search', 'Machine learning']

## Working with Date-Time features

In [None]:
date = "2018-03-14 06:08:18"
re.findall(r"\d{4}", date)

['2018']

In [None]:
re.findall(r"(\d{4})-(\d{2})-(\d{2})", date)

[('2018', '03', '14')]

In [None]:
d = "12th September, 2019"

re.findall(r"(\d{2})\w+\s(\w+),\s(\d{4})", d)

[('12', 'September', '2019')]

[https://docs.python.org/3/howto/regex.html](https://docs.python.org/3/howto/regex.html)

[http://www.rexegg.com/regex-quickstart.html#ref](http://www.rexegg.com/regex-quickstart.html#ref)

In [None]:
import re
import nltk
nltk.download('stopwords')

# download stopwords list from nltk
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def clean_text(text):
    # converting to lowercase
    newString = text.lower()
    # removing links
    newString = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', newString) 
    # fetching alphabetic characters
    newString = re.sub("[^a-zA-Z]", " ", newString)
    # removing stop words
    tokens = [w for w in newString.split() if not w in stop_words]
    # removing short words
    long_words=[]
    for i in tokens:
        if len(i)>=4:                                                 
            long_words.append(i)   
    return (" ".join(long_words)).strip()

In [None]:
# removing links
newString = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', newString)

In [None]:
# fetching alphabetic characters
newString = re.sub("[^a-zA-Z]", " ", newString)

In [None]:
 # removing stop words
tokens = [w for w in newString.split() if not w in stop_words]

In [None]:
# removing short words
long_words=[]
for i in tokens:
    if len(i)>=4: 
        long_words.append(i)
return (" ".join(long_words)).strip()