**Name: Arti Yadav**

**PRN Number: 24070149029**


---



#**NLP Assignmen 02: Named Entity Recognition & Regular Expressions in NLP**



---



**Named Entity Recognition:**
NER is a subtask of information extraction, It identifies and classfies the named entities in text into predefined categories such as


* People(eg: Barack Obama, Elon Musk)
* Organizations(eg: Google, Microsoft)
* Locations(e.g: New York, Paris)
* Dates and Times(e.g: January 1, 2025, 3 PM)
* Monetary values(eg: $500, 50 euros)

NER Uses machine learning models, rule-based systems, or a hybrid approach to identify these entities.
It's useful for tasks like document summarization, question answering, and even chatbots.

Example:

1. Input text: "Barack Obama was born in Hawaii on August 4, 1961."

NER Output:
* Person: Barack Obama
* Location : Hawaii
* Date: August 4, 1961


**Regular Expressions(RegEx)**

Regular Expressions(RegEx): patterns used to search for specific sequences of characters in a text. They are powerful tool for tasks like text search, validation, and manipulation. RegEx doesn't inherently understand the meaning of the words; it simply matches patterns, which can be extremely useful for simpler tasks like validating email addresses, phone numbers, or extracting certain keywords or data.

For example, RegEx could be used to extract dates in a specific format like MM/DD/YYYY from a body of text.

**Example:**

Pattern : \d{2}/\d{2}/\d{4} (this matches dates in the format MM/DD/YYYY) Input text: "The event is scheduled for 12/15/2025." RegEx match: 12/15/2025

**Comparing NER and RegEx:**

* NER is more context-aware and semantic in nature. It understands language and can recognize entities even if they are expressed in varied ways(e.g., "New York City" vs. "NYC").

*RegEx is purely syntactic. It required a fixed pattren to match and doesn't account for meaning or context beyond the structure of the text.

**When to use NER vs. RegEx**

* Use NER when you need to identify and classify entities in a text and the text can vary in format or structure. For example, when building a news article summary or exctracting data from unstructured text.

* Use RegEx when you need to search for, validate, or extract patterns from text that follow a specific, predictable format. For example, validating phone numbers, email addresses, or extracting dates in a uniform format.

**Combining NER and RegEx**

* NER could be used to extract named entities like organizations, and RegEx could be used to extract specific patterns (e.g., dates, phone numbers) from those named entities.


**Part A: Named Entity Recognition (NER)**

Load any **pre-trained SpaCy model** and **perform NER** on the following text:

"Elon Musk founded SpaceX in 2002 and later acquired Twitter, now known as X, in 2022."

Extract **all named entities** along with their entity types.
Display the **entities in a tabular format (Entity, Entity Type).**

**1. Load the SpaCy model: Loading SpaCy's small English model**

**2. Apply NER to the text: We apply the SpaCy NER pipeline to the given sentence.**

In [1]:
import spacy

# Load SpaCy's small English model
nlp = spacy.load("en_core_web_sm")

# Define the input text
text = "Elon Musk founded SpaceX in 2002 and later acquired Twitter, now known as X, in 2022."

# Process the text using the NER pipeline
doc = nlp(text)

# Extract named entities and their types
entities = [(ent.text, ent.label_) for ent in doc.ents]

# Display entities in a table format
import pandas as pd
df = pd.DataFrame(entities, columns=["Entity", "Entity Type"])
print(df)


      Entity Entity Type
0  Elon Musk      PERSON
1       2002        DATE
2    Twitter     PRODUCT
3       2022        DATE


**Explanation:**

Elon Musk: Identified as a PERSON.
SpaceX: Recognized as an ORG (organization).
2002 and 2022: Recognized as DATE entities.
Twitter and X: Recognized as ORG (organization) names.

**2. Write a Python function that takes any text as input and highlights the following entity types:**

**Person**

**Organization**

**Date**

In [2]:
import spacy

# Load the SpaCy pre-trained model
nlp = spacy.load("en_core_web_sm")

def highlight_entities(text):
    # Process the text with SpaCy NER pipeline
    doc = nlp(text)

    # List to store highlighted entities
    highlighted_text = text

    # Loop through the entities and highlight Person, Organization, and Date
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            highlighted_text = highlighted_text.replace(ent.text, f"**{ent.text}**")
        elif ent.label_ == "ORG":
            highlighted_text = highlighted_text.replace(ent.text, f"[[{ent.text}]]")
        elif ent.label_ == "DATE":
            highlighted_text = highlighted_text.replace(ent.text, f"[[{ent.text}]]")

    return highlighted_text

# Example input text
text = "Elon Musk founded SpaceX in 2002 and later acquired Twitter, now known as X, in 2022."

# Call the function
highlighted = highlight_entities(text)

# Display the highlighted text
print(highlighted)




**Elon Musk** founded SpaceX in [[2002]] and later acquired Twitter, now known as X, in [[2022]].


#**Part B: Regular Expressions**

**3. Use Python's re module to extract all email addresses from the following text:
"Please contact us at support@example.com, info@nlp.com, or feedback123@textai.org for further details."**

In [3]:
import re
def extract_emails(text):
  #extracting the mail
  email_pattern = r'[a-zA-Z-0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

  #Find all matching email addresses using the regex pattern
  emails = re.findall(email_pattern, text)

  return emails

In [4]:
text = "Please contact us at support@example.com, info@nlp.com, or feedback123@textai.org for further details."

#Extraxt email addresses
emails = extract_emails(text)

#Display the extracted emails
print(emails)

['support@example.com', 'info@nlp.com', 'feedback123@textai.org']


**4. Write a Python script to perform the following tasks:**

* Replace all digits in the text:
"The meeting is scheduled for 10:30 AM on 25th January 2025."
Replace digits with the string '*'.

* Find and extract all dates from the text:
"Important dates are 25-01-2025, 26/01/2025, and 27-01-2025."

In [5]:
import re

def replace_digits(text):
    # Replace all digits with '*'
    return re.sub(r'\d', '*', text)

def extract_dates(text):
    # Regex pattern to match dates in the formats dd-mm-yyyy or dd/mm/yyyy
    date_pattern = r'\b(?:\d{2}[-/]\d{2}[-/]\d{4})\b'
    return re.findall(date_pattern, text)

# Task 1: Replace all digits with '*'
text1 = "The meeting is scheduled for 10:30 AM on 25th January 2025."
replaced_text = replace_digits(text1)
print("Replaced Text: ", replaced_text)

# Task 2: Extract all dates from the text
text2 = "Important dates are 25-01-2025, 26/01/2025, and 27-01-2025."
dates = extract_dates(text2)
print("Extracted Dates: ", dates)


Replaced Text:  The meeting is scheduled for **:** AM on **th January ****.
Extracted Dates:  ['25-01-2025', '26/01/2025', '27-01-2025']


Task 1: The digits 10:30, 25, and 2025 are replaced with *, resulting in **:** AM on **th January ****.

Task 2: The dates 25-01-2025, 26/01/2025, and 27-01-2025 are correctly extracted from the text.

**5. Implement a function using regular expressions to check whether a given string is a valid Indian phone number (10 digits, starts with 7, 8, or 9). Test your function with various inputs.**

**Criteria:**

1. The number should be exactly 10 digits long.
2. The first digit should be either 7, 8, or 9.
3. The remaining 9 digits can be any digit from 0 to 9.

**Regular Expression:**


^[789]\d{9}$

**Explanation:**

* ^ :ensures start of the string.

* [789]: matches the first digit, which must be 7,8, or 9.

* \d{9}: matches any 9 digits.

* $: ensures ends of the string.

In [6]:
import re

def is_valid_indian_phone_number(phone_number: str) -> bool:
    # Define the regex pattern for a valid Indian phone number
    pattern = r'^[789]\d{9}$'

    # Use re.match to check if the phone number matches the pattern
    if re.match(pattern, phone_number):
        return True
    else:
        return False

# Test cases
test_numbers = [
    "9876543210",  # valid number
    "9123456789",  # valid number
    "8234567890",  # valid number
    "1234567890",  # invalid (doesn't start with 7, 8, or 9)
    "98765",       # invalid (less than 10 digits)
    "98765432101", # invalid (more than 10 digits)
    "abc1234567",  # invalid (contains non-digit characters)
]

# Testing the function
for number in test_numbers:
    result = is_valid_indian_phone_number(number)
    print(f"{number}: {'Valid' if result else 'Invalid'}")


9876543210: Valid
9123456789: Valid
8234567890: Valid
1234567890: Invalid
98765: Invalid
98765432101: Invalid
abc1234567: Invalid
