# **Web Scraping Code**
This code downloads the text description of the world heritage sites in India from Wikipedia using the Wikipedia API, BeautifulSoup and Requests. It does so in two stages:
- 1) Extracts links of Individual Sites from https://en.wikipedia.org/wiki/List_of_World_Heritage_Sites_in_India
- 2) Extracts the text content from each link into a separate markdown file and stores them into a folder called 'sites'

In [25]:
%pip install wikipedia-api --quiet


Note: you may need to restart the kernel to use updated packages.


In [20]:
# Part 1: Extraction of Links from the WHS List page
import requests
from bs4 import BeautifulSoup
import os

page = 'List_of_World_Heritage_Sites_in_India'
# Wikipedia page URL
wiki_url = f"https://en.wikipedia.org/wiki/{page}"  # Replace with the actual Wikipedia page URL

# Fetch the HTML content of the Wikipedia page
response = requests.get(wiki_url)
html_content = response.text

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

pages = [page]
# Find the table with the specified class and iterate through rows to extract links
table = soup.find('table', {'class': 'wikitable'})
if table:
    links = table.select('th[scope="row"] a')
    
    # Extract and print the href attributes (links)
    for link in links:
        href = link.get('href')
        if '%27' in href:
            href = href.replace('%27', "'")
        if href.startswith('/wiki/'):
            pages.append(href[6:])

else:
    print("Table not found on the Wikipedia page.")


In [24]:
#Part 2: Extracting the text from the Wikipedia pages of individual World Heritage Sites and saving it as markdown files

import wikipediaapi

def get_wikipedia_text(page_title):
    wiki_wiki = wikipediaapi.Wikipedia('wikipedia_agent_en')  # 'en' for English Wikipedia, change if needed
    page = wiki_wiki.page(page_title)

    if not page.exists():
        return "Page not found."

    return page.text

def save_as_markdown(text, folder_path, file_name):
    os.makedirs(folder_path, exist_ok=True)
    file_path = os.path.join(folder_path, file_name)
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(text)
count = 1
for page_title in pages:
    result = get_wikipedia_text(page_title)
    if result != "Page not found.":
        # Replace 'output.md' with your desired file path and name
        save_as_markdown(result, 'sites', page_title + '.md')
        print(f"{count} pages out of {len(pages)} done.")
    else:
        print(result)
    count += 1


1 pages out of 43 done.
2 pages out of 43 done.
3 pages out of 43 done.
4 pages out of 43 done.
5 pages out of 43 done.
6 pages out of 43 done.
7 pages out of 43 done.
8 pages out of 43 done.
9 pages out of 43 done.
10 pages out of 43 done.
11 pages out of 43 done.
12 pages out of 43 done.
13 pages out of 43 done.
14 pages out of 43 done.
15 pages out of 43 done.
16 pages out of 43 done.
17 pages out of 43 done.
18 pages out of 43 done.
19 pages out of 43 done.
20 pages out of 43 done.
21 pages out of 43 done.
22 pages out of 43 done.
23 pages out of 43 done.
24 pages out of 43 done.
25 pages out of 43 done.
26 pages out of 43 done.
27 pages out of 43 done.
28 pages out of 43 done.
29 pages out of 43 done.
30 pages out of 43 done.
31 pages out of 43 done.
32 pages out of 43 done.
33 pages out of 43 done.
34 pages out of 43 done.
35 pages out of 43 done.
36 pages out of 43 done.
37 pages out of 43 done.
38 pages out of 43 done.
39 pages out of 43 done.
40 pages out of 43 done.
41 pages 