<a href="https://colab.research.google.com/github/afzal34sl/Data-Science/blob/ML/SpaCy_NP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Extracting Noun Pharses with Adjectives where Nouns are singular**
We will use SpaCy library to achive this task and create a custom Noun Chunker where it can extract multiple adjectives for a single noun without punctuation and conjuctions. 
The Noun Chunker can be changed to meet different needs.

## Install all the libraries, if necessary

In [None]:
# ! pip install spacy #Main Library
# ! pip install watermark
# ! pip install contractions
# ! pip install numpy
# ! pip install pandas
# ! python -m spacy download en_core_web_lg

## Import all the files

In [None]:
import pandas as pd
import numpy as np
import spacy
import contractions


## Watermark

This library specifies what are the specifications of the computer used and library versions

In [None]:
%load_ext watermark
%watermark -a "Afzal Azeem Chowdhary" -u -d -v -m --iversions

Author: Afzal Azeem Chowdhary

Last updated: 2022-07-07

Python implementation: CPython
Python version       : 3.7.13
IPython version      : 5.5.0

Compiler    : GCC 7.5.0
OS          : Linux
Release     : 5.4.188+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit

contractions: 0.1.72
spacy       : 3.3.1
pandas      : 1.3.5
IPython     : 5.5.0
numpy       : 1.21.6



## The Input Text

For simplicity, we will be only assigning the text to a variable called 'txt'.

We will first covert the whole  text to lowercase for getting consistent singular nouns. If not then this will happen in the example text. 
For e.g., Ramen will remain as Ramen but ramen will be converted to raman (singular of ramen). We can skip this step, if needed.



In [None]:
txt = str("""An apple a day keeps the doctor away. The doctor lives with his intelligent wife and their handsome son named Ramon. Ramon doesn't eat apples but he eats Ramen. Although Ramon agrees that green apples are better than hot, vibrant and spicy ramen.""")
# txt = str("""An apple a day keeps the doctor away. The doctor lives with his wife and their son named Ramon. Ramon doesn't eat apples but he eats Ramen. Although Ramon agrees that green apples are better than hot ramen.""")
txt = str("""A shocking video of a driver losing control on a wet road and smashing into another vehicle has surfaced online. 
The short clip was shared on Reddit. The caption of the post informed that the incident took place in Kasaragod in Kerala. The terrifying video shows a speeding car losing control on a wet road. The driver who recorded the incident was driving smoothly when suddenly a red car coming from the opposite direction rammed into it. """)
txt = txt.lower()


#### Using Contractions
Also, we can use the contractions library to convert words like 'doesn't' to 'does not', 'isn't' to 'is not', etc. Although, SpaCy is robust to handle such situations. We can skip this step, if needed.


In [None]:
# txt = contractions.fix(txt)

## Data Formatting

Create a dataframe and split based on fullstop.




In [None]:
df = pd.DataFrame(columns=['sentence'])
df['sentence'] = ([x for x in txt.split('.')])


Convert each dataframe cell to string




In [None]:
print(df['sentence'].astype(str))

0    a shocking video of a driver losing control on...
1                  the short clip was shared on reddit
2     the caption of the post informed that the inc...
3     the terrifying video shows a speeding car los...
4     the driver who recorded the incident was driv...
5                                                     
Name: sentence, dtype: object


## Load the Corpus

In [None]:
nlp = spacy.load('en_core_web_lg') #The last two characters can be changed to sm and md based on requirements.

##POS Tagging

We will iterate each sentence and get the tags.

Created a function that first uused to the SpaCy corpus to get the 

In [None]:
def onegram(text):
    doc = nlp(text) #SpaCy does need to be explicitly told to first tokenize the sentences than the words and later getting the tags.
    result = []
    print("{0:10} {1:10} {2:8} {3:8} {4:8} {5:8} {6:8} {7:8}".format("text", "lemma_", "pos_", "tag_", "dep_",
            "shape_", "is_alpha", "is_stop"))

    for token in doc:
        print("{0:10} {1:10} {2:8} {3:8} {4:8} {5:8} {6:8} {7:8}".format(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop))
        result.append((token.text, token.tag_))
    return result
df['sentence'].apply(onegram)

text       lemma_     pos_     tag_     dep_     shape_   is_alpha is_stop 
a          a          DET      DT       det      x               1        1
shocking   shocking   ADJ      JJ       amod     xxxx            1        0
video      video      NOUN     NN       nsubj    xxxx            1        0
of         of         ADP      IN       prep     xx              1        1
a          a          DET      DT       det      x               1        1
driver     driver     NOUN     NN       npadvmod xxxx            1        0
losing     lose       VERB     VBG      amod     xxxx            1        0
control    control    NOUN     NN       pobj     xxxx            1        0
on         on         ADP      IN       prep     xx              1        1
a          a          DET      DT       det      x               1        1
wet        wet        ADJ      JJ       amod     xxx             1        0
road       road       NOUN     NN       pobj     xxxx            1        0
and        a

0    [(a, DT), (shocking, JJ), (video, NN), (of, IN...
1    [( , _SP), (the, DT), (short, JJ), (clip, NN),...
2    [( , _SP), (the, DT), (caption, NN), (of, IN),...
3    [( , _SP), (the, DT), (terrifying, JJ), (video...
4    [( , _SP), (the, DT), (driver, NN), (who, WP),...
5                                           [( , _SP)]
Name: sentence, dtype: object

## Custom Noun Chunker

The inbuild function "noun_chunks" will get all the Noun Phrases (NPs). But, it will miss Proper Nouns like 'day' in the example text. Since, our requirements are different we will create a custom noun chunker or parser.

The custom noun chunker is 'extractNP' which calls all a another recursive function called 'check_children' which get all the Adjectives of a Noun as required in our case. Since, a Tree structure  is created during POS Tagging, the noun will have its adjectives, conjuction and punctuation as its children. Therefore, we loop through them to get all the adjectives of a noun.

In [None]:
def check_children(token):
    chunk = ''
    for w in token.children:
        # print(w.text)
        if w.pos_ == 'ADJ':# or w.pos_ == 'CCONJ': #or w.pos_ == 'PUNCT'
            chunk = chunk + w.text + ' ' + check_children(w)
            
    return chunk
    
def extractNP(text):
    doc = nlp(text)
    result = []
    for token in doc:
        if token.pos_ == 'NOUN' or token.pos_ == 'PROPN':
            chunk = ''
            chunk = check_children(token) + token.lemma_
            if chunk != '':
#               print(chunk)
                result.append(chunk)
    return result

### Display the result after converting back to list

In [None]:
df['sentence'].apply(extractNP).tolist()

[['shocking video', 'driver', 'control', 'wet road', 'vehicle'],
 ['short clip', 'reddit'],
 ['caption', 'post', 'incident', 'place', 'kasaragod', 'kerala'],
 ['terrifying video', 'car', 'control', 'wet road'],
 ['driver', 'incident', 'red car', 'opposite direction'],
 []]

#**All the code is loaded into a single cell to run once expect the text cell.**

In [None]:
txt = str("""An apple a day keeps the doctor away. The doctor lives with his intelligent wife and their handsome son named Ramon. Ramon doesn't eat apples but he eats Ramen. Although Ramon agrees that green apples are better than hot, vibrant and spicy ramen.""")
# txt = str("""An apple a day keeps the doctor away. The doctor lives with his wife and their son named Ramon. Ramon doesn't eat apples but he eats Ramen. Although Ramon agrees that green apples are better than hot ramen.""")
txt = str("""A shocking video of a driver losing control on a wet road and smashing into another vehicle has surfaced online. 
The short clip was shared on Reddit. The caption of the post informed that the incident took place in Kasaragod in Kerala. The terrifying video shows a speeding car losing control on a wet road. The driver who recorded the incident was driving smoothly when suddenly a red car coming from the opposite direction rammed into it. """)
# txt = txt.replace("\n","")
txt = txt.lower()

In [None]:
# ! pip install spacy #Main Library
# ! pip install watermark
# ! pip install contractions
# ! pip install numpy
# ! pip install pandas
# ! python -m spacy download en_core_web_lg

#Importing and Loading libraries
import pandas as pd
import numpy as np
import spacy
import contractions

%load_ext watermark
%watermark -a "Afzal Azeem Chowdhary" -u -d -v -m --iversions

#Working with Data
df = pd.DataFrame(columns=['sentence'])
df['sentence'] = ([x for x in txt.split('.')])

df['sentence'].astype(str)

#Loading Scorpus
nlp = spacy.load('en_core_web_lg') #The last two characters can be changed to sm and md based on requirements, but they need to be installed as well.

#POS Tagging
def onegram(text):
    doc = nlp(text) #SpaCy does need to be explicitly told to first tokenize the sentences than the words and later getting the tags.
    result = []
    # print("{0:10} {1:10} {2:8} {3:8} {4:8} {5:8} {6:8} {7:8}".format("text", "lemma_", "pos_", "tag_", "dep_",
    #         "shape_", "is_alpha", "is_stop"))

    # for token in doc:
    #     print("{0:10} {1:10} {2:8} {3:8} {4:8} {5:8} {6:8} {7:8}".format(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
    #         token.shape_, token.is_alpha, token.is_stop))
    #     result.append((token.text, token.tag_))
    return result

df['sentence'].apply(onegram)


##Chunking NP's
def check_children(token):
    chunk = ''
    for w in token.children:
        # print(w.text)
        if w.pos_ == 'ADJ':# or w.pos_ == 'CCONJ': #or w.pos_ == 'PUNCT'
            chunk = chunk + w.text + ' ' + check_children(w)
            
    return chunk
    
def extractNP(text):
    doc = nlp(text)
    result = []
    for token in doc:
        if token.pos_ == 'NOUN' or token.pos_ == 'PROPN':
            chunk = ''
            chunk = check_children(token) + token.lemma_
            if chunk != '':
#               print(chunk)
                result.append(chunk)
    return result

#Displaying Result
df['sentence'].apply(extractNP).tolist()

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Author: Afzal Azeem Chowdhary

Last updated: 2022-07-07

Python implementation: CPython
Python version       : 3.7.13
IPython version      : 5.5.0

Compiler    : GCC 7.5.0
OS          : Linux
Release     : 5.4.188+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit

contractions: 0.1.72
spacy       : 3.3.1
pandas      : 1.3.5
IPython     : 5.5.0
numpy       : 1.21.6



[['shocking video', 'driver', 'control', 'wet road', 'vehicle'],
 ['short clip', 'reddit'],
 ['caption', 'post', 'incident', 'place', 'kasaragod', 'kerala'],
 ['terrifying video', 'car', 'control', 'wet road'],
 ['driver', 'incident', 'red car', 'opposite direction'],
 []]