# <center>Assignment 1 - Natural Language Processing</center>
#### **Name:** Fatima Azfar
#### **Roll no:** 20L-1027
#### **Section:** BDS-8A

# <center>Part 1: Regular Expressions and Preprocessing</center>

### Question 1
Describe the class of strings matched by the following regular expressions.
- [a-zA-Z]+
- [A-Z][a-z]*
- p[aeiou]{,2}t
- \d+(\.\d+)?
- ([^aeiou][aeiou][^aeiou])*
- \w+|[^\w\s]+

Test your answers using nltk.re_show(). (You will have to import libraries using “import nltk, re, pprint”)

In [15]:
import nltk, re, pprint

In [16]:
regex_patterns = {
    "[a-zA-Z]+": "One or more letters (uppercase or lowercase).",
    "[A-Z][a-z]*": "An uppercase letter followed by zero or more lowercase letters.",
    "p[aeiou]{,2}t": "A 'p' followed by at most two vowels and ending with 't'.",
    "\\d+(\\.\\d+)?": "An integer or a decimal number (with optional fractional part).",
    "([^aeiou][aeiou][^aeiou])*": "Zero or more occurrences of a non-vowel, a vowel, and a non-vowel.",
    "\\w+|[^\w\s]+": "One or more word characters or one or more characters that are neither word characters nor whitespace."
}

test_strings = {
    "[a-zA-Z]+": "Hello World!",
    "[A-Z][a-z]*": "Hello world",
    "p[aeiou]{,2}t": "pat, pet, peat, pt",
    "\\d+(\\.\\d+)?": "123, 4.56, .78",
    "([^aeiou][aeiou][^aeiou])*": "bcdfghjklmnpqrstvwxyz",
    "\\w+|[^\w\s]+": "Hello, World!"
}

def test_regex_patterns():
    for pattern, description in regex_patterns.items():
        print(f"Regular Expression: {pattern} -> {description}")
        print(f"Testing with string: '{test_strings[pattern]}'")
        nltk.re_show(pattern, test_strings[pattern])
        print("-"*80)

test_regex_patterns()

Regular Expression: [a-zA-Z]+ -> One or more letters (uppercase or lowercase).
Testing with string: 'Hello World!'
{Hello} {World}!
--------------------------------------------------------------------------------
Regular Expression: [A-Z][a-z]* -> An uppercase letter followed by zero or more lowercase letters.
Testing with string: 'Hello world'
{Hello} world
--------------------------------------------------------------------------------
Regular Expression: p[aeiou]{,2}t -> A 'p' followed by at most two vowels and ending with 't'.
Testing with string: 'pat, pet, peat, pt'
{pat}, {pet}, {peat}, {pt}
--------------------------------------------------------------------------------
Regular Expression: \d+(\.\d+)? -> An integer or a decimal number (with optional fractional part).
Testing with string: '123, 4.56, .78'
{123}, {4.56}, .{78}
--------------------------------------------------------------------------------
Regular Expression: ([^aeiou][aeiou][^aeiou])* -> Zero or more occurrences

### Question 2
Write regular expressions to match the following classes of strings:

- A single determiner (assume that a, an, and the are the only determiners).
- An arithmetic expression using integers, addition, and multiplication, such as 2*3+8.

In [46]:
regex_patterns = {
    r"\b(a|an|the)\b": "A single determiner.",
    r"\b\d+\s*[\+\*]\s*\d+\s*[\+\*]\s*\d+\b": "An arithmetic expression using integers, addition, and multiplication, such as 2*3+8."
}

test_strings = {
    r"\b(a|an|the)\b": "This is an apple. The apple is a fruit.",
    r"\b\d+\s*[\+\*]\s*\d+\s*[\+\*]\s*\d+\b": "Here are two expressions: 2*3+8 and 3+5*9",
}

def test_regex_patterns():
    for pattern, description in regex_patterns.items():
        print(f"Regular Expression: {pattern} -> {description}")
        print(f"Testing with string: '{test_strings[pattern]}'")
        print("Matches found:", re.findall(pattern, test_strings[pattern], re.IGNORECASE))
        print("-"*80)

test_regex_patterns()

Regular Expression: \b(a|an|the)\b -> A single determiner.
Testing with string: 'This is an apple. The apple is a fruit.'
Matches found: ['an', 'The', 'a']
--------------------------------------------------------------------------------
Regular Expression: \b\d+\s*[\+\*]\s*\d+\s*[\+\*]\s*\d+\b -> An arithmetic expression using integers, addition, and multiplication, such as 2*3+8.
Testing with string: 'Here are two expressions: 2*3+8 and 3+5*9'
Matches found: ['2*3+8', '3+5*9']
--------------------------------------------------------------------------------


# <center>Data Scraping</center>

### Question 3
Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use from urllib import request and then request urlopen(‘https://www.csail.mit.edu/people/’).read().decode('utf8') to access the contents of the URL. Use BeautifulSoup(html).get_text() to parse html.

Import the following for this question:
- from urllib import request
- from bs4 import BeautifulSoup

In [24]:
from urllib import request
from bs4 import BeautifulSoup
import nltk

In [37]:
def get_text_from_url(url):
    try:
        response = request.urlopen(url)
        html = response.read().decode('utf8')
        soup = BeautifulSoup(html, 'html.parser')
        text = soup.get_text()
        return text
    except Exception as e:
        return f"An error occurred: {e}"

In [38]:
url = 'https://www.csail.mit.edu/people/'
text = get_text_from_url(url)
print(text)













People | MIT CSAIL






    Skip to main content
  









For Students


For Industry


For Members


Accessibility


Login

























MIT CSAIL





Research


People


News


Events


Symposia


About














MIT LOGO
Created with Sketch.
























Research


People


News


Events


Symposia


About








For Students


For Industry


For Members


Accessibility


Login








Contact


Press Requests


Accessibility











Search






























MIT CSAIL


Massachusetts Institute of Technology
Computer Science & Artificial Intelligence Laboratory
32 Vassar St, Cambridge MA 02139




Contact


Press Requests


Accessibility



























# <center>Tokenization</center>

### Question 4
Tokenize text parsed from the above url using nltk. Find all phone numbers and email addresses from this text using regular expressions. (Do not tokenize text otherwise email addresses will be incorrectly tokenized)

In [34]:
def find_contact_info(url):
    text = get_text_from_url(url)
    
    phone_regex = r'(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}'
    email_regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    
    phone_numbers = re.findall(phone_regex, text)
    email_addresses = re.findall(email_regex, text)

    return phone_numbers, email_addresses

In [36]:
# MIT Website
url = 'https://www.csail.mit.edu/people/'
phone_numbers, email_addresses = find_contact_info(url)
print("-------MIT-------")
print("Phone Numbers:", phone_numbers)
print("Email Addresses:", email_addresses)

# FAST NU Website
url = 'https://lhr.nu.edu.pk/faculty/'
phone_numbers, email_addresses = find_contact_info(url)
print("-------FAST NUCES-------")
print("Phone Numbers:", phone_numbers)
print("Email Addresses:", email_addresses)

-------MIT-------
Phone Numbers: []
Email Addresses: []
-------FAST NUCES-------
Phone Numbers: []
Email Addresses: ['admissions.lhr@nu.edu.pk', 'kashif.zafar@nu.edu.pk', 'aamir.wali@nu.edu.pk', 'asif.gilani@nu.edu.pk', 'hammad.naveed@nu.edu.pk', 'zareen.alamgir@nu.edu.pk', 'arshad.ali1@nu.edu.pk', 'asma.naseer@nu.edu.pk', 'irfan.younas@nu.edu.pk', 'r.asif@nu.edu.pk', 'saira.karim@nu.edu.pk', 'zeeshanali.khan@nu.edu.pk', 'aatira.anum@nu.edu.pk', 'ali.afzal@nu.edu.pk', 'ammar.haider@nu.edu.pk', 'asma.ahmad@nu.edu.pk', 'faisal.aslam@nu.edu.pk', 'farooq.ahmad@nu.edu.pk', 'hajra.waheed@nu.edu.pk', 'haroon.mahmood@nu.edu.pk', 'iqra.safder@nu.edu.pk', 'maryam.bashir@nu.edu.pk', 'mubasher.baig@nu.edu.pk', 'muhammad.ahmadraza@nu.edu.pk', 'm.Irteza@nu.edu.pk', 'tahir.ejaz@nu.edu.pk', 'zeeshan.rana@nu.edu.pk', 'aamir.raheem@nu.edu.pk', 'abeeda.akram@nu.edu.pk', 'ishaq.raza@nu.edu.pk', 'lehmia.kiran@nu.edu.pk', 'noshaba.nasir@nu.edu.pk', 'ali.omer@nu.edu.pk', 'samin.iftikhar@nu.edu.pk', 'sobia.ta

The emails present on MIT's page were not fetched and parsed because of the dynamic objects within which the emails exist are present on the website.

# <center>Stemming</center>

### Question 5
Use the Porter Stemmer to normalize some tokenized text, calling the stemmer on each word. Do the same thing with the Lancaster Stemmer and see if you observe any differences

In [39]:
from nltk.stem import PorterStemmer, LancasterStemmer
from nltk.tokenize import word_tokenize

In [40]:
tokens = word_tokenize(text)

porter = PorterStemmer()
lancaster = LancasterStemmer()

# Porter Stemmer
porter_stemmed = [porter.stem(token) for token in tokens]
print("Porter Stemmer:", porter_stemmed)

# Stemmer
lancaster_stemmed = [lancaster.stem(token) for token in tokens]
print("Lancaster Stemmer:", lancaster_stemmed)

Porter Stemmer: ['peopl', '|', 'mit', 'csail', 'skip', 'to', 'main', 'content', 'for', 'student', 'for', 'industri', 'for', 'member', 'access', 'login', 'mit', 'csail', 'research', 'peopl', 'new', 'event', 'symposia', 'about', 'mit', 'logo', 'creat', 'with', 'sketch', '.', 'research', 'peopl', 'new', 'event', 'symposia', 'about', 'for', 'student', 'for', 'industri', 'for', 'member', 'access', 'login', 'contact', 'press', 'request', 'access', 'search', 'mit', 'csail', 'massachusett', 'institut', 'of', 'technolog', 'comput', 'scienc', '&', 'artifici', 'intellig', 'laboratori', '32', 'vassar', 'st', ',', 'cambridg', 'ma', '02139', 'contact', 'press', 'request', 'access']
Lancaster Stemmer: ['peopl', '|', 'mit', 'csail', 'skip', 'to', 'main', 'cont', 'for', 'stud', 'for', 'industry', 'for', 'memb', 'access', 'login', 'mit', 'csail', 'research', 'peopl', 'new', 'ev', 'sympos', 'about', 'mit', 'logo', 'cre', 'with', 'sketch', '.', 'research', 'peopl', 'new', 'ev', 'sympos', 'about', 'for', '

### Question 6
For this question, assume you have a shy friend who is hesitating to tell you something, so he/ she sent a long random text on WhatsApp that also contains his/ her message. Since you are a Regex Guru, your task is to extract the actual message from the random text using regular expressions and some rules.

In [42]:
message = "Pila Forfeited you engrossed but 1kometimes explained. Another 1kacokaco1 as studied it to evident. Merry sense 9given he be arisepila. Conduct at an replied removal an amongst. Remainingzalima 0determine few her two cordially Zalima admitting old. Sometimes ctra*nger his pisdsdla ourselves her co*la depending you boy. Eat discretion cultivated possession far comparison projection pila considered. And few fat interested discovered inquietude insensible unsatiable increasing zalima eat."

In [58]:
patterns = {
    'first_word': r'\b[Zz][a-z]*a\b',
    'second_word': r'\b\d[k][a-z]*\d\b',
    'third_word': r'\bc[a-z]*\*[a-z]+a\b',
    'fourth_word': r'\b[Pp][a-z]{2}a\b'
}

results = {}

for key, pattern in patterns.items():
    matches = re.findall(pattern, message)
    results[key] = matches

# First word
first_word = results['first_word'][0] if results['first_word'] else None
first_word_count = len(results['first_word'])

# Second word
second_word_raw = results['second_word'][0] if results['second_word'] else None
second_word = second_word_raw[3:-3] if second_word_raw else None

# Third word
third_word_raw = results['third_word'][0] if results['third_word'] else None
third_word = third_word_raw.replace('*', '') if third_word_raw else None

# Fourth word
fourth_word = results['fourth_word'][0] if results['fourth_word'] else None
fourth_word_count = len(results['fourth_word'])

# Fifth word
fifth_word = "de"

print(f"First Word: {first_word}, Frequency: {first_word_count}")
print(f"Second Word: {second_word}")
print(f"Third Word: {third_word}")
print(f"Fourth Word: {fourth_word}, Frequency: {fourth_word_count}")
print(f"Third Word: {fifth_word}")
print("The hidden sentence is: ",first_word +" "+ second_word +" "+ third_word +" "+ fourth_word +" "+ fifth_word)


First Word: Zalima, Frequency: 2
Second Word: coka
Third Word: cola
Fourth Word: Pila, Frequency: 2
Third Word: de
The hidden sentence is:  Zalima coka cola Pila de
