#1. Retrieve the following Web Page        
Use the requests library to fetch the content of https://www.visitmaryland.org/      
What is the HTTP status code?

In [2]:
pip install requests beautifulsoup4




In [1]:
import requests

url = "https://www.visitmaryland.org/"
resp = requests.get(url, timeout=30)

print("HTTP Status Code:", resp.status_code)

HTTP Status Code: 403


The requests library is used to send an HTTP GET request to the specified website. The server responds with a status code that indicates whether the request was successful. In this case, the response returns a 403 Forbidden status code because the website uses bot-protection mechanisms that block automated requests made by scripts instead of real browsers. This means the request reached the server, but access to the actual page content was denied.

# 2. Extract the Main Page Text
Using BeautifulSoup, extract and print all the visible text from the page (ignore script and style tags).

In [3]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(resp.text, "html.parser")

# Remove script and style elements
for tag in soup(["script", "style"]):
    tag.decompose()

# Extract visible text
text = soup.get_text(separator="\n")
lines = [line.strip() for line in text.splitlines() if line.strip()]

print("\n".join(lines[:40]))


Just a moment...
Enable JavaScript and cookies to continue


The extracted text shows a Cloudflare bot-protection message instead of the actual website content. This occurs because the requests library does not execute JavaScript or manage browser cookies, which are required by the site. As a result, the server returns a challenge page rather than the real content.

In [8]:
import requests
from bs4 import BeautifulSoup

wiki_url = "https://en.wikipedia.org/wiki/Natural_language_processing"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36"
}

wiki_resp = requests.get(wiki_url, headers=headers, timeout=30)

print("Status:", wiki_resp.status_code)
print("Final URL:", wiki_resp.url)
print("Length of HTML:", len(wiki_resp.text))
print("First 200 chars:\n", wiki_resp.text[:200])

wiki_soup = BeautifulSoup(wiki_resp.text, "html.parser")

headings = wiki_soup.select("h1, h2, h3")
links = wiki_soup.select("a[href]")

print("Headings found:", len(headings))
print("Links found:", len(links))

Status: 200
Final URL: https://en.wikipedia.org/wiki/Natural_language_processing
Length of HTML: 308626
First 200 chars:
 <!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pin
Headings found: 22
Links found: 969


# 3. Extract Headings from a Wikipedia Page
Fetch the Wikipedia page for Natural Language Processing.
Extract and print all the headings from the page.

In [9]:
headings = wiki_soup.select("h1, h2, h3")

for h in headings:
    print(f"{h.name}: {h.get_text(strip=True)}")

h2: Contents
h1: Natural language processing
h2: History
h3: Symbolic NLP (1950s – early 1990s)
h3: Statistical NLP (1990s–present)
h2: Approaches: Symbolic, statistical, neural networks
h3: Statistical approach
h3: Neural networks
h2: Common NLP tasks
h3: Text and speech processing
h3: Morphological analysis
h3: Syntactic analysis
h3: Lexical semantics (of individual words in context)
h3: Relational semantics (semantics of individual sentences)
h3: Discourse (semantics beyond individual sentences)
h3: Higher-level NLP applications
h2: General tendencies and (possible) future directions
h3: Cognition
h2: See also
h2: References
h2: Further reading
h2: External links


# 4. Extract Links from a Page
From the Wikipedia page above, extract and print all the URLs (href values) of links found on the page.

In [10]:
links = wiki_soup.select("a[href]")

for a in links[:100]:
    print(a["href"])

print("Total links:", len(links))

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Special:SpecialPages
/wiki/Main_Page
/wiki/Special:Search
https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en
/w/index.php?title=Special:CreateAccount&returnto=Natural+language+processing
/w/index.php?title=Special:UserLogin&returnto=Natural+language+processing
https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en
/w/index.php?title=Special:CreateAccount&returnto=Natural+language+processing
/w/index.php?title=Special:UserLogin&returnto=Natural+language+processing
#
#History
#Symbolic_NLP_(1950s_–_early_1990s)
#Statistical_NLP_(1990s–present)
#Approaches:_Symbolic,_sta

# 5. Extract and Save Text
Extract the first paragraph(< p >) from the Wikipedia page and save it to a local text file called nlp_intro.txt.

In [11]:
first_p = wiki_soup.find("p")
first_paragraph_text = first_p.get_text(strip=True) if first_p else ""

with open("nlp_intro.txt", "w", encoding="utf-8") as f:
    f.write(first_paragraph_text)

print("Saved first paragraph to nlp_intro.txt")


Saved first paragraph to nlp_intro.txt
