# Text Preprocessing Part 2

## 1. Parts of Speech (POS) Tagging:
Parts of Speech (POS) tagging is the process of identifying and labeling words in a sentence based on their grammatical roles. Each word is tagged with a label such as noun, verb, adjective, etc. POS tagging is essential in many natural language processing (NLP) tasks, such as information retrieval, sentiment analysis, and language translation.

## Case Study: POS Tagging in Python 
Problem Statement
Suppose we are given a sentence, and we want to automatically tag each word with its part of speech. We'll use Python and open source libraries to perform POS tagging and analyze the result.

In [None]:
#Install spacy and Download the Model
! pip install spacy
! python -m spacy download en_core_web_sm



In [2]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp("The quick brown fox jumps over the lazy dog.")


In [4]:
for token in doc:
    print(token.text, token.pos_)

The DET
quick ADJ
brown ADJ
fox NOUN
jumps VERB
over ADP
the DET
lazy ADJ
dog NOUN
. PUNCT


## Understanding the POS Tags

#### Each tuple consists of a word and its POS tag. Here’s what some of the tags represent:

##### 1.DET: Determiner (e.g., "the")
##### 2.ADJ: Adjective (e.g., "quick", "brown", "lazy")
##### 3.Noun: Noun (e.g., "fox", "dog")
##### 4.VERB: Verb, 3rd person singular present (e.g., "jumps")
##### 5.ADP: Preposition adposition (e.g., "over")
##### 6..: Punctuation

### Case Study Scenario: POS Tagging in a Text Processing Pipeline
Imagine you’re building a sentiment analysis model for customer reviews. Before analyzing the text, POS tagging can help:

#### Extract adjectives to detect sentiment-heavy words.
#### Identify verbs and nouns to understand the actions and objects in customer feedback.

## 2. Dependency parsing 
Dependency parsing is a process in natural language processing (NLP) where the grammatical structure of a sentence is analyzed to identify relationships between "head" words and words that modify those heads (dependencies). It helps determine how different words in a sentence are related to each other, providing insights into the syntactic structure of the sentence.
#### For example, in the sentence "She read a book", the subject "She" depends on the verb "read", and the object "book" also depends on the verb "read". Dependency parsing reveals this kind of relationship.

In [7]:
### Parse a sentence
# Sample sentence
sentence = "She read a book."

# Parse the sentence using the spaCy model
doc = nlp(sentence)

#### Visualize Dependencies Using spaCy’s Dependency Parser
Now, let's visualize the dependency tree using spaCy's displacy module, which provides an easy way to display syntactic dependencies.

In [8]:
from spacy import displacy

# Display the dependency tree
displacy.render(doc, style='dep', jupyter=True)

#### In this visualization:

Arrows indicate the direction of dependency between words.
Labels on the arrows describe the type of grammatical relationship (e.g., nsubj, dobj, etc.).

In [9]:
### Analyze Dependency Relations Programmatically
for token in doc:
    print(f"Token: {token.text}, Head: {token.head.text}, Dependency: {token.dep_}, POS: {token.pos_}")

Token: She, Head: read, Dependency: nsubj, POS: PRON
Token: read, Head: read, Dependency: ROOT, POS: VERB
Token: a, Head: book, Dependency: det, POS: DET
Token: book, Head: read, Dependency: dobj, POS: NOUN
Token: ., Head: read, Dependency: punct, POS: PUNCT


#### Explanation:

##### Token: The word being processed.
##### Head: The word this token is syntactically dependent on (head word).
##### Dependency: The type of grammatical relation (e.g., nsubj for subject, dobj for direct object).
##### POS: Part-of-speech tag (e.g., VERB for verb, NOUN for noun).

###### In the sentence "She read a book.":

##### "She" is the subject (nsubj) of the verb "read".
##### "book" is the direct object (dobj) of the verb "read".
##### "a" is a determiner (det) modifying "book".
##### The verb "read" is the root of the sentence (ROOT).

## Understanding Dependency Tags
##### Here are some common dependency tags:

### nsubj: Nominal subject (e.g., the subject of a verb)
### dobj: Direct object (e.g., the object that the verb is acting upon)
### det: Determiner (e.g., "a", "the")
### ROOT: Root of the sentence (main verb)
### amod: Adjectival modifier (e.g., adjectives that modify nouns)
### prep: Prepositional modifier (e.g., the word following a preposition)
### pobj: Object of a preposition (e.g., the noun following a preposition)

In [10]:
## Parsing a More Complex Sentence
complex_sentence = "The quick brown fox jumps over the lazy dog."
doc_complex = nlp(complex_sentence)

for token in doc_complex:
    print(f"Token: {token.text}, Head: {token.head.text}, Dependency: {token.dep_}, POS: {token.pos_}")


Token: The, Head: fox, Dependency: det, POS: DET
Token: quick, Head: fox, Dependency: amod, POS: ADJ
Token: brown, Head: fox, Dependency: amod, POS: ADJ
Token: fox, Head: jumps, Dependency: nsubj, POS: NOUN
Token: jumps, Head: jumps, Dependency: ROOT, POS: VERB
Token: over, Head: jumps, Dependency: prep, POS: ADP
Token: the, Head: dog, Dependency: det, POS: DET
Token: lazy, Head: dog, Dependency: amod, POS: ADJ
Token: dog, Head: over, Dependency: pobj, POS: NOUN
Token: ., Head: jumps, Dependency: punct, POS: PUNCT


### Explanation:

#### "fox" is the subject of the verb "jumps" (indicated by nsubj).
#### "dog" is the object of the preposition "over" (pobj).
#### The adjectives "quick" and "brown" are adjectival modifiers (amod) of "fox".

In [11]:
## Extracting Dependencies
for token in doc:
    if token.dep_ == "nsubj":
        subject = token.text
    if token.dep_ == "dobj":
        obj = token.text
    if token.dep_ == "ROOT":
        verb = token.text

print(f"Subject: {subject}, Verb: {verb}, Object: {obj}")


Subject: She, Verb: read, Object: book


### Dependency Parsing for Advanced NLP Tasks
##### Dependency parsing is often used for:

##### Information extraction: Extracting relations between entities like subject-verb-object triplets.
##### Question answering: Understanding the syntactic structure to interpret queries.
##### Text summarization: Understanding sentence structure to compress text meaningfully.
##### Coreference resolution: Identifying which words refer to the same entity.

## 3.Named Entity Recognition (NER) 
Named Entity Recognition is a Natural Language Processing (NLP) task that involves identifying and classifying named entities in text into predefined categories such as "Person", "Organization", "Location", "Date", etc. NER is widely used in various applications like information extraction, question answering, and text summarization.

In [12]:
# Sample text for NER
text = "Apple is looking at buying U.K. startup for $1 billion in 2024."
# Process the text
doc = nlp(text)
# Print named entities with their labels
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

Entity: Apple, Label: ORG
Entity: U.K., Label: GPE
Entity: $1 billion, Label: MONEY
Entity: 2024, Label: DATE


#### Explanation:

#### Apple is recognized as an Organization (ORG).
#### U.K. is classified as a Geopolitical Entity (GPE).
#### $1 billion is recognized as Money (MONEY).
#### 2024 is classified as a Date (DATE).

#### Understand Named Entity Labels
##### spaCy uses several predefined labels for named entities. Some common ones are:

##### PERSON: People, including fictional.
##### ORG: Companies, agencies, institutions.
##### GPE: Countries, cities, states.
##### DATE: Dates or periods.
##### MONEY: Monetary values.
##### LOC: Non-GPE locations, mountain ranges, bodies of water.

In [13]:
# all the labels in the en_core_web_sm model 
from spacy import displacy

# List all entity labels
print(spacy.explain("GPE"))  # Outputs: Geopolitical entity
print(spacy.explain("ORG"))  # Outputs: Organization

Countries, cities, states
Companies, agencies, institutions, etc.


In [14]:
# Visualize named entities in the text
displacy.render(doc, style="ent", jupyter=True)


In [15]:
# Applying NER to a Larger Document
# Longer text example
long_text = """
Google, headquartered in Mountain View, unveiled the new Pixel at an event in San Francisco.
CEO Sundar Pichai said the company is committed to bringing AI technology to everyone.
The device is priced at $799 and will be available starting November 2024.
"""

# Process the text for NER
doc = nlp(long_text)

# Extract named entities
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")


Entity: Google, Label: ORG
Entity: Mountain View, Label: GPE
Entity: Pixel, Label: PERSON
Entity: San Francisco, Label: GPE
Entity: Sundar Pichai, Label: PERSON
Entity: AI, Label: ORG
Entity: 799, Label: MONEY
Entity: November 2024, Label: DATE


## Case Study: Financial Reporting

In [16]:
def analyze_financial_report(text):
    # Step 1: Process the text using spaCy NLP pipeline
    doc = nlp(text)
    
    # Step 2: Named Entity Recognition (NER)
    print("Named Entity Recognition (NER):")
    for ent in doc.ents:
        print(f"Entity: {ent.text}, Label: {ent.label_}")
    print("\n")
    
    # Step 3: Part-of-Speech (POS) Tagging
    print("Part-of-Speech (POS) Tagging:")
    for token in doc:
        print(f"Token: {token.text}, POS: {token.pos_}, Tag: {token.tag_}, Lemma: {token.lemma_}")
    print("\n")
    
    # Step 4: Dependency Parsing
    print("Dependency Parsing:")
    for token in doc:
        print(f"Token: {token.text}, Head: {token.head.text}, Dependency: {token.dep_}, POS: {token.pos_}")
    print("\n")
    
    # Optional: Visualize dependency parsing and named entities in Jupyter Notebooks
    print("Visualizing Named Entities and Dependency Parsing...\n")
    displacy.render(doc, style="ent", jupyter=True)  # NER visualization
    displacy.render(doc, style="dep", jupyter=True)  # Dependency parsing visualization


In [17]:
financial_text = """
In Q3 2023, Apple Inc. reported a net income of $25 billion, a 5% increase compared to Q3 2022. 
The company attributed the growth to strong sales of the iPhone 15 and MacBook Pro. 
Additionally, Microsoft Corporation announced a partnership with Apple to expand cloud services. 
The partnership is valued at $10 billion and is expected to boost both companies' market shares.
"""


In [18]:
analyze_financial_report(financial_text)


Named Entity Recognition (NER):
Entity: Q3 2023, Label: DATE
Entity: Apple Inc., Label: ORG
Entity: $25 billion, Label: MONEY
Entity: 5%, Label: PERCENT
Entity: Q3 2022, Label: PERSON
Entity: MacBook Pro, Label: PERSON
Entity: Microsoft Corporation, Label: ORG
Entity: Apple, Label: ORG
Entity: $10 billion, Label: MONEY


Part-of-Speech (POS) Tagging:
Token: 
, POS: SPACE, Tag: _SP, Lemma: 

Token: In, POS: ADP, Tag: IN, Lemma: in
Token: Q3, POS: PROPN, Tag: NNP, Lemma: Q3
Token: 2023, POS: NUM, Tag: CD, Lemma: 2023
Token: ,, POS: PUNCT, Tag: ,, Lemma: ,
Token: Apple, POS: PROPN, Tag: NNP, Lemma: Apple
Token: Inc., POS: PROPN, Tag: NNP, Lemma: Inc.
Token: reported, POS: VERB, Tag: VBD, Lemma: report
Token: a, POS: DET, Tag: DT, Lemma: a
Token: net, POS: ADJ, Tag: JJ, Lemma: net
Token: income, POS: NOUN, Tag: NN, Lemma: income
Token: of, POS: ADP, Tag: IN, Lemma: of
Token: $, POS: SYM, Tag: $, Lemma: $
Token: 25, POS: NUM, Tag: CD, Lemma: 25
Token: billion, POS: NUM, Tag: CD, Lemma: bill

## Case Study 2: Feature-Opinion Pair Extraction from Customer Reviews

### Objective:
The goal of the project is to extract specific product features (like battery life, screen quality, etc.) from customer reviews, along with the adjectives (opinions) that describe them, and identify the relationships between the features and the opinions using dependency parsing.

### Steps:
#### Part-of-Speech Tagging will help us identify nouns (features) and adjectives (opinions).
#### Dependency Parsing will help us determine the grammatical relationship between features and their associated opinions (e.g., “battery life” is modified by “great”).

In [29]:
import spacy
from spacy import displacy

# Load the spaCy English model
nlp = spacy.load('en_core_web_sm')

def extract_feature_opinion_pairs(review_text):
    # Process the review text using spaCy
    doc = nlp(review_text)
    
    # Create a list to store feature-opinion pairs
    feature_opinion_pairs = []
    
    # Loop through each token to identify nouns (features) and their modifiers (adjectives)
    for token in doc:
        # Look for adjectives that modify nouns either with `amod`, `attr`, or as a predicative complement
        if token.pos_ == 'ADJ':
            # Check for adjectival modifier relationships (amod, attr, acomp)
            # If adjective modifies a noun (amod) or is a complement (attr/acomp), extract the noun
            if token.dep_ == 'amod':  # Adjectival modifier
                feature = token.head
                feature_opinion_pairs.append((feature.text, token.text))
            elif token.dep_ in ['attr', 'acomp']:  # Attributive complement
                # Find the subject of the sentence (likely a product feature noun)
                for possible_noun in token.head.lefts:
                    if possible_noun.pos_ == 'NOUN' or possible_noun.pos_ == 'PROPN':
                        feature_opinion_pairs.append((possible_noun.text, token.text))
    
    # Print the extracted feature-opinion pairs
    print("Feature-Opinion Pairs:")
    for feature, opinion in feature_opinion_pairs:
        print(f"Feature: {feature}, Opinion: {opinion}")

    # Optional: Visualize the dependency parsing (for Jupyter)
    print("\nVisualizing Dependency Parsing...")
    displacy.render(doc, style='dep', jupyter=True)

    return feature_opinion_pairs


In [30]:
### sample Reviews
review_1 = "The battery life is amazing and the camera quality is excellent."
review_2 = "I love the sleek design, but the screen resolution could be better."
review_3 = "The performance is fast, but the storage space is too limited."


In [31]:
# Analyze review 1
print("Review 1:")
extract_feature_opinion_pairs(review_1)



Review 1:
Feature-Opinion Pairs:
Feature: life, Opinion: amazing
Feature: quality, Opinion: excellent

Visualizing Dependency Parsing...


[('life', 'amazing'), ('quality', 'excellent')]

In [32]:
# Analyze review 2
print("Review 2:")
extract_feature_opinion_pairs(review_2)


Review 2:
Feature-Opinion Pairs:
Feature: design, Opinion: sleek
Feature: resolution, Opinion: better

Visualizing Dependency Parsing...


[('design', 'sleek'), ('resolution', 'better')]

In [33]:
# Analyze review 3
print("Review 3:")
extract_feature_opinion_pairs(review_3)


Review 3:
Feature-Opinion Pairs:
Feature: performance, Opinion: fast
Feature: space, Opinion: limited

Visualizing Dependency Parsing...


[('performance', 'fast'), ('space', 'limited')]