<a href="https://colab.research.google.com/github/azoqi/Natural-Language-Processing/blob/main/ClassCoding1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Problem: we would like to search Wikipedia for any mention of St. Edward's University. This would include any mentions of St. Ed, St. Ed's, St. Edwards, St. Edward's, St. Edwards' etc.

Requirements: write a Python program using the Beautiful Soup module to parse the wiki which can be found at https://en.wikipedia.org/wiki/Main_PageLinks to an external site.. Devise a regular expression to match the St. Ed's pattern described in the problem statement. Your program should print the title of every article that matches the pattern and its URL.

You should use a time.sleep function call in your program to avoid making excessive requests to Wikipedia.

In [None]:
import requests
import re
import time
from bs4 import BeautifulSoup

In [None]:
base_url = "https://en.wikipedia.org"
main_url = base_url + "/wiki/Main_Page"

# Request the main page
response = requests.get(main_url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all links that are likely article links
article_links = set()
for link in soup.find_all("a", href=True):
    href = link["href"]
    if href.startswith("/wiki/") and ":" not in href:
        article_links.add(href)

In [None]:
# Regular expression to match any variant of "St. Edward's" or "St. Ed" etc.
# This will match:
#    "St. Ed", "St. Ed's", "St. Edwards", "St. Edward's", and "St. Edwards'"
pattern = re.compile(r"St\.?\s+Ed(?:ward(?:'s|s|s')|'?s)?", re.IGNORECASE)

print("Searching for articles mentioning St. Edward's University...\n")

# Loop over each article link
for href in article_links:
    article_url = base_url + href
    try:
        article_resp = requests.get(article_url)
    except Exception as e:
        print(f"Error fetching {article_url}: {e}")
        continue

    time.sleep(1)

    article_soup = BeautifulSoup(article_resp.text, 'html.parser')
    article_text = article_soup.get_text()

    # Search for our pattern in the article text
    if pattern.search(article_text):
        heading = article_soup.find("h1", id="firstHeading")
        title = heading.text.strip() if heading else "No Title"
        print(f"Title: {title}\nURL: {article_url}\n")


Searching for articles mentioning St. Edward's University...

Title: February 10
URL: https://en.wikipedia.org/wiki/February_10

Title: Helene Kröller-Müller
URL: https://en.wikipedia.org/wiki/Helene_Kr%C3%B6ller-M%C3%BCller

Title: Humphrey III of Toron
URL: https://en.wikipedia.org/wiki/Humphrey_III_of_Toron

Title: 2025 Southeast Europe retail boycotts
URL: https://en.wikipedia.org/wiki/2025_Southeast_Europe_retail_boycotts

Title: Maria Einsmann
URL: https://en.wikipedia.org/wiki/Maria_Einsmann

Title: Hottingen (Zurich)
URL: https://en.wikipedia.org/wiki/Hottingen_(Zurich)

Title: Robbery
URL: https://en.wikipedia.org/wiki/Robbery

Title: American football
URL: https://en.wikipedia.org/wiki/American_football



KeyboardInterrupt: 