<a href="https://colab.research.google.com/github/faisu6339-glitch/Webscraping/blob/main/Wikipedia_WebScrap.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here is a **single-paragraph explanation** you can paste above your code in Google Colab:

---

This code imports the `requests` and `BeautifulSoup` libraries, which are commonly used for web scraping. The `requests` library allows Python to send HTTP requests to a webpage and download its content, while `BeautifulSoup` is used to parse the HTML returned from the webpage into a structured format that can be searched and navigated easily. Together, these two tools enable us to access a website, read its HTML, and extract useful information such as text, tags, links, or any specific data contained within the page.


In [1]:
import requests
from bs4 import BeautifulSoup


This code sends a request to the Wikipedia page of Mahatma Gandhi using requests.get() and stores the response in the variable res. The downloaded webpage content (HTML code) is accessed through res.text. Then, this HTML string is passed into BeautifulSoup with the 'html.parser' option, which converts it into a structured, searchable object called soup. This allows us to easily navigate and extract specific elements such as headings, paragraphs, tables, or links from the webpage.

In [2]:
res=requests.get('https://en.wikipedia.org/wiki/Mahatma_Gandhi')
soup=BeautifulSoup(res.text,'html.parser')

In [3]:
soup

Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also https://phabricator.wikimedia.org/T400119.

This code adds a custom HTTP header so that the request looks like it is coming from a real web browser, which helps avoid getting blocked by websites such as Wikipedia. We create a dictionary called headers and set the User-Agent value to mimic a Chrome browser running on Windows. Then, we pass these headers to requests.get() when downloading the webpage of Mahatma Gandhi, ensuring a more reliable response. The returned HTML content from the webpage is stored in res.text and is then parsed using BeautifulSoup with the 'html.parser' option, converting the raw HTML into a structured object called soup, which makes it easier to locate and extract different parts of the webpage like headings, paragraphs, links, and other elements.

In [4]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
res=requests.get('https://en.wikipedia.org/wiki/Mahatma_Gandhi', headers=headers)
soup=BeautifulSoup(res.text,'html.parser')

In [5]:
soup

<!DOCTYPE html>

<html lang="en">
<meta charset="utf-8"/>
<title>Wikimedia Error</title>
<style>
* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; display: flex; flex-direction: row; flex-wrap: wrap; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
.content-text { flex: 1; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645ad; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: sans-serif; }
summary { font-weight: bold; cursor: pointer; }
details[open] { background: #970302; color: #dfdedd; }
.text-muted { color: #777; }
@media (prefers-color-scheme: dark) {
  a { color: #9e9eff; }
  body { background: transparent; color: #ddd; }
  .footer { border-top: 

Finding the heading

In [None]:
heading=soup.find('h1').text
print(heading)

Mahatma Gandhi


In [None]:
page_title = soup.title.text
print(page_title)

Mahatma Gandhi - Wikipedia


In [None]:
soup.text



In [None]:
soup.text.strip()



In [None]:
print(soup.text.replace('\n\n',''))

Mahatma Gandhi - WikipediaJump to contentMain menuMain menu
move to sidebar
hide		Navigation
	
Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us		Contribute
	
HelpLearn to editCommunity portalRecent changesUpload fileSpecial pagesSearchSearch
Appearance
DonateCreate accountLog in
Personal toolsDonate Create account Log in
Contents
move to sidebar
hide
(Top)1
Early life and background
Toggle Early life and background subsection1.1
Parents
1.2
Childhood
1.3
Marriage
2
Three years in London
Toggle Three years in London subsection2.1
Student of law
2.2
Vegetarianism and committee work
2.3
Called to the bar
3
Civil rights activist in South Africa (1893–1914)
Toggle Civil rights activist in South Africa (1893–1914) subsection3.1
Europeans, Indians and Africans
4
Struggle for Indian independence (1915–1947)
Toggle Struggle for Indian independence (1915–1947) subsection4.1
Role in World War I
4.2
Champaran agitations
4.3
Kheda agitations
4.4
Khilafat Movement
4.5
Non-co-op

Accessing the whole text




In [None]:
page_text = soup.get_text(separator=' ', strip=True)
print(page_text)

additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy . Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. , a non-profit organization. Privacy policy About Wikipedia Disclaimers Contact Wikipedia Code of Conduct Developers Statistics Cookie statement Mobile view Search Search Toggle the table of contents Mahatma Gandhi 204 languages Add topic


In [None]:
print(soup.text.replace('\n\n',''))

Mahatma Gandhi - WikipediaJump to contentMain menuMain menu
move to sidebar
hide		Navigation
	
Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us		Contribute
	
HelpLearn to editCommunity portalRecent changesUpload fileSpecial pagesSearchSearch
Appearance
DonateCreate accountLog in
Personal toolsDonate Create account Log in
Contents
move to sidebar
hide
(Top)1
Early life and background
Toggle Early life and background subsection1.1
Parents
1.2
Childhood
1.3
Marriage
2
Three years in London
Toggle Three years in London subsection2.1
Student of law
2.2
Vegetarianism and committee work
2.3
Called to the bar
3
Civil rights activist in South Africa (1893–1914)
Toggle Civil rights activist in South Africa (1893–1914) subsection3.1
Europeans, Indians and Africans
4
Struggle for Indian independence (1915–1947)
Toggle Struggle for Indian independence (1915–1947) subsection4.1
Role in World War I
4.2
Champaran agitations
4.3
Kheda agitations
4.4
Khilafat Movement
4.5
Non-co-op

Accessing the couple of paragraph

In [None]:
soup.find_all('p')

[<p class="mw-empty-elt">
 </p>,
 <p><b>Mohandas Karamchand Gandhi</b><sup class="reference" id="cite_ref-4"><a href="#cite_note-4"><span class="cite-bracket">[</span>c<span class="cite-bracket">]</span></a></sup> (2<span class="nowrap"> </span>October 1869 – 30<span class="nowrap"> </span>January 1948)<sup class="reference" id="cite_ref-5"><a href="#cite_note-5"><span class="cite-bracket">[</span>2<span class="cite-bracket">]</span></a></sup> was an Indian lawyer, <a href="/wiki/Nationalism#anti-colonial" title="Nationalism">anti-colonial nationalist</a>, and <a href="/wiki/Political_ethics" title="Political ethics">political ethicist</a> who employed <a href="/wiki/Nonviolent_resistance" title="Nonviolent resistance">nonviolent resistance</a> to lead the successful <a href="/wiki/Indian_independence_movement" title="Indian independence movement">campaign for India's independence</a> from <a href="/wiki/British_Raj" title="British Raj">British rule</a>. He inspired movements for <a hr

In [None]:
soup.find_all('p')[0:3]

[<p class="mw-empty-elt">
 </p>,
 <p><b>Mohandas Karamchand Gandhi</b><sup class="reference" id="cite_ref-4"><a href="#cite_note-4"><span class="cite-bracket">[</span>c<span class="cite-bracket">]</span></a></sup> (2<span class="nowrap"> </span>October 1869 – 30<span class="nowrap"> </span>January 1948)<sup class="reference" id="cite_ref-5"><a href="#cite_note-5"><span class="cite-bracket">[</span>2<span class="cite-bracket">]</span></a></sup> was an Indian lawyer, <a href="/wiki/Nationalism#anti-colonial" title="Nationalism">anti-colonial nationalist</a>, and <a href="/wiki/Political_ethics" title="Political ethics">political ethicist</a> who employed <a href="/wiki/Nonviolent_resistance" title="Nonviolent resistance">nonviolent resistance</a> to lead the successful <a href="/wiki/Indian_independence_movement" title="Indian independence movement">campaign for India's independence</a> from <a href="/wiki/British_Raj" title="British Raj">British rule</a>. He inspired movements for <a hr

In [None]:
for p in soup.find_all('p'):
  print(p.text)
  print('-'*10)



----------
Mohandas Karamchand Gandhi[c] (2 October 1869 – 30 January 1948)[2] was an Indian lawyer, anti-colonial nationalist, and political ethicist who employed nonviolent resistance to lead the successful campaign for India's independence from British rule. He inspired movements for civil rights and freedom across the world. The honorific Mahātmā (from Sanskrit, meaning great-souled, or venerable), first applied to him in South Africa in 1914, is used worldwide.[3]

----------
Born and raised in a Hindu family in coastal Gujarat, Gandhi was trained in the law at the Inner Temple in London and was called to the bar at the age of 22. After two uncertain years in India, where he was unable to start a successful law practice, Gandhi moved to South Africa in 1893 to represent an Indian merchant in a lawsuit. He went on to live in South Africa for the next 21 years. Here, Gandhi raised a family and first employed nonviolent resistance in a campaign for civil rights. In 1915, aged 45, h

In [None]:
for p in soup.find_all('p')[0:2]:
  print(p.text)
  print('-'*10)



----------
Mohandas Karamchand Gandhi[c] (2 October 1869 – 30 January 1948)[2] was an Indian lawyer, anti-colonial nationalist, and political ethicist who employed nonviolent resistance to lead the successful campaign for India's independence from British rule. He inspired movements for civil rights and freedom across the world. The honorific Mahātmā (from Sanskrit, meaning great-souled, or venerable), first applied to him in South Africa in 1914, is used worldwide.[3]

----------


This code loops through all paragraph (<p>) tags found in the webpage using soup.find_all('p'). For each paragraph, it first checks whether the text inside it is not empty by using p.text.strip(), which removes extra whitespace and prevents blank paragraphs from being printed. If the paragraph contains text, the cleaned paragraph is printed, followed by a row of dashes ('-'*10) to visually separate each output. A counter variable named count is increased every time a valid paragraph is printed, and once the counter reaches 5, the loop stops using break, which ensures only the first five meaningful paragraphs from the page are displayed.

In [None]:
count = 0
for p in soup.find_all('p'):
    if p.text.strip():  # Check if the paragraph has non-empty text
        print(p.text.strip())
        print('-'*10)
        count += 1
        if count == 5:
            break

Mohandas Karamchand Gandhi[c] (2 October 1869 – 30 January 1948)[2] was an Indian lawyer, anti-colonial nationalist, and political ethicist who employed nonviolent resistance to lead the successful campaign for India's independence from British rule. He inspired movements for civil rights and freedom across the world. The honorific Mahātmā (from Sanskrit, meaning great-souled, or venerable), first applied to him in South Africa in 1914, is used worldwide.[3]
----------
Born and raised in a Hindu family in coastal Gujarat, Gandhi was trained in the law at the Inner Temple in London and was called to the bar at the age of 22. After two uncertain years in India, where he was unable to start a successful law practice, Gandhi moved to South Africa in 1893 to represent an Indian merchant in a lawsuit. He went on to live in South Africa for the next 21 years. Here, Gandhi raised a family and first employed nonviolent resistance in a campaign for civil rights. In 1915, aged 45, he returned to 

**This code collects text from all HTML paragraph (<p>) tags on a webpage and stores it in a single string called corpus. It loops through each paragraph found with soup.find_all('p'), which returns every <p> element in the parsed HTML. For each paragraph, it checks if the text is not empty using p.text.strip(), because .strip() removes extra spaces, tabs, and newline characters from the beginning and end of the paragraph text. If the cleaned paragraph contains actual content, that text is added to corpus, followed by a newline ('\n') so each paragraph appears on a separate line. This prevents storing empty or meaningless paragraphs and keeps the output organized and readable. Finally, print(corpus) displays the combined text extracted from all non-empty paragraphs on the page.**

In [None]:
corpus=''
for p in soup.find_all('p'):
    if p.text.strip():  # Check if the paragraph has non-empty text
        corpus=corpus+p.text.strip()
        corpus=corpus+'\n'

print(corpus)

Mohandas Karamchand Gandhi[c] (2 October 1869 – 30 January 1948)[2] was an Indian lawyer, anti-colonial nationalist, and political ethicist who employed nonviolent resistance to lead the successful campaign for India's independence from British rule. He inspired movements for civil rights and freedom across the world. The honorific Mahātmā (from Sanskrit, meaning great-souled, or venerable), first applied to him in South Africa in 1914, is used worldwide.[3]
Born and raised in a Hindu family in coastal Gujarat, Gandhi was trained in the law at the Inner Temple in London and was called to the bar at the age of 22. After two uncertain years in India, where he was unable to start a successful law practice, Gandhi moved to South Africa in 1893 to represent an Indian merchant in a lawsuit. He went on to live in South Africa for the next 21 years. Here, Gandhi raised a family and first employed nonviolent resistance in a campaign for civil rights. In 1915, aged 45, he returned to India and s

remove unwanted elements from the paragraph

In [None]:
for i in range(3,467):
  print('['+str(i)+']')

[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
[70]
[71]
[72]
[73]
[74]
[75]
[76]
[77]
[78]
[79]
[80]
[81]
[82]
[83]
[84]
[85]
[86]
[87]
[88]
[89]
[90]
[91]
[92]
[93]
[94]
[95]
[96]
[97]
[98]
[99]
[100]
[101]
[102]
[103]
[104]
[105]
[106]
[107]
[108]
[109]
[110]
[111]
[112]
[113]
[114]
[115]
[116]
[117]
[118]
[119]
[120]
[121]
[122]
[123]
[124]
[125]
[126]
[127]
[128]
[129]
[130]
[131]
[132]
[133]
[134]
[135]
[136]
[137]
[138]
[139]
[140]
[141]
[142]
[143]
[144]
[145]
[146]
[147]
[148]
[149]
[150]
[151]
[152]
[153]
[154]
[155]
[156]
[157]
[158]
[159]
[160]
[161]
[162]
[163]
[164]
[165]
[166]
[167]
[168]
[169]
[170]
[171]
[172]
[173]
[174]
[175]
[176]
[177]
[178]
[179]
[180]
[181]
[182]
[183]
[184]
[185]
[186]


These strings i want to replace

Here is a **single-paragraph explanation** you can paste above your code in Colab:

---

This code removes citation numbers that appear in the text of the scraped webpage, such as `[3]`, `[24]`, or `[466]`, which commonly come from Wikipedia references. The loop runs from 3 up to 466 using `range(3, 467)`, and in each iteration it tries to find the citation format `"[number]"` inside the string `corpus`. If such a pattern exists, it is replaced with an empty string, effectively deleting it from the text. For example, if the paragraph contains `"Gandhi was born in Porbandar.[12]"`, the code will detect `"[12]"` and remove it. Repeating this replacement across all numbers ensures the final cleaned text is easier to read and free of unnecessary reference markers.


In [None]:
for i in range(3,467):
  corpus=corpus.replace('['+str(i)+']','')

In [None]:
print(corpus)

Mohandas Karamchand Gandhi[c] (2 October 1869 – 30 January 1948)[2] was an Indian lawyer, anti-colonial nationalist, and political ethicist who employed nonviolent resistance to lead the successful campaign for India's independence from British rule. He inspired movements for civil rights and freedom across the world. The honorific Mahātmā (from Sanskrit, meaning great-souled, or venerable), first applied to him in South Africa in 1914, is used worldwide.
Born and raised in a Hindu family in coastal Gujarat, Gandhi was trained in the law at the Inner Temple in London and was called to the bar at the age of 22. After two uncertain years in India, where he was unable to start a successful law practice, Gandhi moved to South Africa in 1893 to represent an Indian merchant in a lawsuit. He went on to live in South Africa for the next 21 years. Here, Gandhi raised a family and first employed nonviolent resistance in a campaign for civil rights. In 1915, aged 45, he returned to India and soon

Sve the updated file

In [None]:
fd=open(heading + '.txt','w')
fd.write(corpus)
fd.close()

✅ 1. Extract the page title

In [None]:
title = soup.title.text
print(title)


Mahatma Gandhi - Wikipedia


Extract heading

In [None]:
from ast import main
main_heading = soup.find('h1').text
print(main_heading)

Mahatma Gandhi


✅ 3. Extract the first paragraph (introduction text)

In [None]:
intro = soup.find('p').text.strip()
print(intro)





In [None]:
import re
clean_intro = re.sub(r'\[\d+\]', '', intro)


✅ 4. Extract all paragraphs

In [None]:
paragraphs = [p.text for p in soup.find_all('p')]


In [None]:
paragraphs

['\n',
 "Mohandas Karamchand Gandhi[c] (2\xa0October 1869\xa0– 30\xa0January 1948)[2] was an Indian lawyer, anti-colonial nationalist, and political ethicist who employed nonviolent resistance to lead the successful campaign for India's independence from British rule. He inspired movements for civil rights and freedom across the world. The honorific Mahātmā (from Sanskrit, meaning great-souled, or venerable), first applied to him in South Africa in 1914, is used worldwide.[3]\n",
 'Born and raised in a Hindu family in coastal Gujarat, Gandhi was trained in the law at the Inner Temple in London and was called to the bar at the age of 22. After two uncertain years in India, where he was unable to start a successful law practice, Gandhi moved to South Africa in 1893 to represent an Indian merchant in a lawsuit. He went on to live in South Africa for the next 21 years. Here, Gandhi raised a family and first employed nonviolent resistance in a campaign for civil rights. In 1915, aged 45, he

✅ 5. Extract Info-box data (summary on the right)

Here is a **single-paragraph explanation** you can paste above your code in Colab:

---

This code tries to locate the infobox section of a Wikipedia page, which typically contains key summary information in a table format on the right side of the article. It does this by searching for an HTML `<table>` element with the class name `'infobox'` using `soup.find()`, and stores the result in the variable `infobox`. Once the infobox table is found, the code retrieves all its table rows by calling `infobox.find_all('tr')`, which returns every `<tr>` element representing a single row of information inside the infobox. These rows usually include the person’s image, name, dates, birthplace, occupations, and other key facts that can later be extracted or processed individually.


In [None]:
infobox = soup.find('table', {'class':'infobox'})
rows = infobox.find_all('tr')


Here is a **single-paragraph explanation** you can paste above your code in Colab:

---

This code extracts structured information from the rows of the Wikipedia infobox table and stores it in a Python dictionary. It loops through every `<tr>` row in the `rows` list, and for each row it looks for a header cell (`<th>`) and a value cell (`<td>`). If both exist, the text content of the header is cleaned using `.text.strip()` to remove extra spaces or newline characters, and this becomes the dictionary key. The corresponding value text, also cleaned with `.strip()`, becomes the value associated with that key in the dictionary. As a result, important facts such as “Born”, “Died”, “Nationality”, or “Occupation” get stored in the `data` dictionary in a clear key–value format that can be printed, processed further, or converted into other data structures.


In [None]:
data = {}
for row in rows:
    header = row.find('th')
    value = row.find('td')
    if header and value:
        data[header.text.strip()] = value.text.strip()


In [None]:
data

{'Born': 'Mohandas Karamchand Gandhi(1869-10-02)2 October 1869Porbandar, Kathiawar Agency, British India',
 'Died': '30 January 1948(1948-01-30) (aged\xa078)New Delhi, India',
 'Cause\xa0of death': 'Assassination by gunshot',
 'Monuments': 'Raj Ghat, Delhi\nGandhi Smriti, New Delhi',
 'Other\xa0names': 'Bāpū (father), Rāṣṭrapitā (the Father of the Nation)',
 'Alma\xa0mater': 'Samaldas Arts College[a]University College London[b]Inns of Court School of Law',
 'Occupations': 'Lawyeractivistpolitician',
 'Years\xa0active': '1893–1948',
 'Known\xa0for': "Leadership of the campaign for India's independence from British ruleNonviolent resistance",
 'Political party': 'Indian National Congress (1920–1934)',
 'Spouse': 'Kasturba Gandhi\n\u200b \u200b(m.\xa01883; died\xa01944)\u200b',
 'Children': 'HarilalManilalRamdasDevdas',
 'Parents': 'Karamchand GandhiPutlibai Gandhi',
 'Relatives': 'Gandhi family',
 'Preceded by': 'Maulana Azad',
 'Succeeded by': 'Sarojini Naidu'}

Extract heading sections

In [None]:
sections=[h.text for h in soup.find_all(['h2','h3'])]

In [None]:
sections

['Contents',
 'Early life and background',
 'Parents',
 'Childhood',
 'Marriage',
 'Three years in London',
 'Student of law',
 'Vegetarianism and committee work',
 'Called to the bar',
 'Civil rights activist in South Africa (1893–1914)',
 'Europeans, Indians and Africans',
 'Struggle for Indian independence (1915–1947)',
 'Role in World War I',
 'Champaran agitations',
 'Kheda agitations',
 'Khilafat Movement',
 'Non-co-operation',
 'Salt Satyagraha (Salt March/Civil Disobedience Movement)',
 'Gandhi as folk hero',
 'Negotiations',
 'Round Table Conferences',
 'Congress politics',
 'World War II and Quit India movement',
 'Partition and independence',
 'Death',
 'Funeral and memorials',
 'Principles, practices, and beliefs',
 'Truth and Satyagraha',
 'Nonviolence',
 'Brahmacharya: abstinence from sex and food',
 'Literary works',
 'Legacy',
 'Followers and international influence',
 'Global days that celebrate Gandhi',
 'Awards',
 'Film, theatre, and literature',
 '21st-century impac

✅ 7. Extract hyperlinks

In [None]:
links = []
for a in soup.find_all('a', href=True):
    links.append(a['href'])


In [None]:
print(links)

['#bodyContent', '/wiki/Main_Page', '/wiki/Wikipedia:Contents', '/wiki/Portal:Current_events', '/wiki/Special:Random', '/wiki/Wikipedia:About', '//en.wikipedia.org/wiki/Wikipedia:Contact_us', '/wiki/Help:Contents', '/wiki/Help:Introduction', '/wiki/Wikipedia:Community_portal', '/wiki/Special:RecentChanges', '/wiki/Wikipedia:File_upload_wizard', '/wiki/Special:SpecialPages', '/wiki/Main_Page', '/wiki/Special:Search', 'https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en', '/w/index.php?title=Special:CreateAccount&returnto=Mahatma+Gandhi', '/w/index.php?title=Special:UserLogin&returnto=Mahatma+Gandhi', 'https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en', '/w/index.php?title=Special:CreateAccount&returnto=Mahatma+Gandhi', '/w/index.php?title=Special:UserLogin&returnto=Mahatma+Gandhi', '#', '#Early_life_and_background', '#Parents', '#Childhood', '#Marriage', '#Three_years_in_Lon

In [None]:
external_links = [l for l in links if l.startswith('http')]


In [None]:
external_links

['https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en',
 'https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en',
 'https://kbd.wikipedia.org/wiki/%D0%9C%D0%B0%D1%85%D0%B0%D1%82%D0%BC%D0%B0_%D0%93%D0%B0%D0%BD%D0%B4%D0%B8',
 'https://ady.wikipedia.org/wiki/%D0%9C%D0%B0%D1%85%D0%B0%D1%82%D0%BC%D0%B0_%D0%93%D0%B0%D0%BD%D0%B4%D0%B8',
 'https://af.wikipedia.org/wiki/Mahatma_Gandhi',
 'https://als.wikipedia.org/wiki/Mohandas_Karamchand_Gandhi',
 'https://am.wikipedia.org/wiki/%E1%88%9B%E1%88%85%E1%89%B0%E1%88%9B_%E1%8C%8B%E1%8A%95%E1%8B%B2',
 'https://ang.wikipedia.org/wiki/Mohandas_Karamchand_Gandhi',
 'https://ar.wikipedia.org/wiki/%D9%85%D9%87%D8%A7%D8%AA%D9%85%D8%A7_%D8%BA%D8%A7%D9%86%D8%AF%D9%8A',
 'https://an.wikipedia.org/wiki/Mohandas_Karamchand_Gandhi',
 'https://hyw.wikipedia.org/wiki/%D5%84%D5%A1%D5%B0%D5%A1%D5%A9%D5%B4%D5%A1_%D4%BF%D5%A1%D5%B6%D5%BF%D5%AB',
 'https://frp.

✅ 8. Extract images

In [None]:
imgs = [img['src'] for img in soup.find_all('img')]


In [None]:
imgs

['/static/images/icons/wikipedia.png',
 '/static/images/mobile/copyright/wikipedia-wordmark-en.svg',
 '/static/images/mobile/copyright/wikipedia-tagline-en.svg',
 '//upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/20px-Symbol_support_vote.svg.png',
 '//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/7/7a/Mahatma-Gandhi%2C_studio%2C_1931.jpg/250px-Mahatma-Gandhi%2C_studio%2C_1931.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Mohandas_K._Gandhi_signature.svg/250px-Mohandas_K._Gandhi_signature.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/1/11/Mohandas_K_Gandhi%2C_age_7.jpg/250px-Mohandas_K_Gandhi%2C_age_7.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Gandhi_and_Laxmidas_2.jpg/250px-Gandhi_and_Laxmidas_2.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/e/e6/MAHATMA_GANDHI_1869-1948_lived_here_as_a_law_s

In [None]:
full_urls = ['https:'+i for i in imgs if i.startswith('//')]


In [None]:
full_urls

['https://upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/20px-Symbol_support_vote.svg.png',
 'https://upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/7/7a/Mahatma-Gandhi%2C_studio%2C_1931.jpg/250px-Mahatma-Gandhi%2C_studio%2C_1931.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Mohandas_K._Gandhi_signature.svg/250px-Mohandas_K._Gandhi_signature.svg.png',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/11/Mohandas_K_Gandhi%2C_age_7.jpg/250px-Mohandas_K_Gandhi%2C_age_7.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Gandhi_and_Laxmidas_2.jpg/250px-Gandhi_and_Laxmidas_2.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/MAHATMA_GANDHI_1869-1948_lived_here_as_a_law_student.jpg/250px-MAHATMA_GANDHI_1869-1948_lived_here_as_a_law_student.jpg',
 'https://upload.wikimedia.org/wikipedia/com

✅ 15. Extract lists (Biography, Early Life, etc.)

In [None]:
lists = soup.find_all('ul')
for ul in lists[:5]:
    print([li.text for li in ul.find_all('li')])


['Main page', 'Contents', 'Current events', 'Random article', 'About Wikipedia', 'Contact us']
['Help', 'Learn to edit', 'Community portal', 'Recent changes', 'Upload file', 'Special pages']
[]
[]
[]


✅ (**IMPORTANT**) Extract internal Wikipedia links with titles

In [None]:
internal_links = []
for a in soup.find_all('a', href=True):
    href = a['href']
    if href.startswith('/wiki/') and not ':' in href:
        internal_links.append('https://en.wikipedia.org' + href)


In [None]:
internal_links

['https://en.wikipedia.org/wiki/Main_Page',
 'https://en.wikipedia.org/wiki/Main_Page',
 'https://en.wikipedia.org/wiki/Mahatma_Gandhi',
 'https://en.wikipedia.org/wiki/Mahatma_Gandhi',
 'https://en.wikipedia.org/wiki/Mahatma_Gandhi',
 'https://en.wikipedia.org/wiki/Gandhi_(disambiguation)',
 'https://en.wikipedia.org/wiki/Mah%C4%81tm%C4%81',
 'https://en.wikipedia.org/wiki/Porbandar',
 'https://en.wikipedia.org/wiki/Assassination_of_Mahatma_Gandhi',
 'https://en.wikipedia.org/wiki/Raj_Ghat_and_associated_memorials',
 'https://en.wikipedia.org/wiki/Gandhi_Smriti',
 'https://en.wikipedia.org/wiki/Father_of_the_Nation',
 'https://en.wikipedia.org/wiki/Samaldas_Arts_College',
 'https://en.wikipedia.org/wiki/University_College_London',
 'https://en.wikipedia.org/wiki/City_Law_School',
 'https://en.wikipedia.org/wiki/Indian_independence_movement',
 'https://en.wikipedia.org/wiki/British_Raj',
 'https://en.wikipedia.org/wiki/Nonviolent_resistance',
 'https://en.wikipedia.org/wiki/Indian_Nati

✅ 18. Extract the Table of Contents (TOC)

In [None]:
toc_list = soup.find('ul', id='mw-panel-toc-list')
if toc_list:
    for li in toc_list.find_all('li'):
        print(li.text.strip())
else:
    print("Table of Contents not found with the specified ID.")

(Top)
1
Early life and background




Toggle Early life and background subsection





1.1
Parents








1.2
Childhood








1.3
Marriage
1.1
Parents
1.2
Childhood
1.3
Marriage
2
Three years in London




Toggle Three years in London subsection





2.1
Student of law








2.2
Vegetarianism and committee work








2.3
Called to the bar
2.1
Student of law
2.2
Vegetarianism and committee work
2.3
Called to the bar
3
Civil rights activist in South Africa (1893–1914)




Toggle Civil rights activist in South Africa (1893–1914) subsection





3.1
Europeans, Indians and Africans
3.1
Europeans, Indians and Africans
4
Struggle for Indian independence (1915–1947)




Toggle Struggle for Indian independence (1915–1947) subsection





4.1
Role in World War I








4.2
Champaran agitations








4.3
Kheda agitations








4.4
Khilafat Movement








4.5
Non-co-operation








4.6
Salt Satyagraha (Salt March/Civil Disobedience Movement)








4.7
Gandhi as folk hero










Remove stopwords and tokensize

Extract named entities (NER)

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(clean_text)

entities = [(ent.text, ent.label_) for ent in doc.ents]


In [None]:
death_section_content = []

# 1. Find the 'Death' (or similar) heading span by id
death_heading = soup.find('span', {'id': 'Death'})

# Many Wikipedia pages use different ids like 'Assassination' or 'Death_and_legacy'
# So you can also try a fallback:
if not death_heading:
    death_heading = soup.find('span', {'id': 'Assassination'})

if death_heading:
    # 2. Get the parent heading tag (h2, h3, etc.)
    current_element = death_heading.find_parent(['h2', 'h3', 'h4', 'h5', 'h6'])

    if current_element:
        # 3. Iterate through siblings until another heading of same or higher level
        for sibling in current_element.next_siblings:
            # Skip non-tag elements like '\n'
            if not hasattr(sibling, "name"):
                continue

            # If we hit another heading of same or higher level, stop
            if (
                sibling.name
                and sibling.name.startswith('h')
                and sibling.name[1].isdigit()
                and int(sibling.name[1]) <= int(current_element.name[1])
            ):
                break

            # Collect paragraph text
            if sibling.name == 'p' and sibling.get_text(strip=True):
                death_section_content.append(sibling.get_text(strip=True))

            # Collect list items (if any)
            elif sibling.name == 'ul':
                for li in sibling.find_all('li'):
                    text = li.get_text(strip=True)
                    if text:
                        death_section_content.append(text)

# 4. Join and print
full_death_content = '\n'.join(death_section_content)
print(full_death_content if full_death_content else "No death section found.")


In [None]:
full_death_content

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Mahatma_Gandhi"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")

death_section_content = []

# 1️⃣ Try to find the 'Death' section heading in a robust way
# First by id
death_heading = soup.find("span", {"id": "Death"})

# Fallback: by class + visible text
if not death_heading:
    death_heading = soup.find(
        "span",
        class_="mw-headline",
        string=lambda t: t and "Death" in t
    )

if death_heading:
    # 2️⃣ Get the parent heading tag (h2/h3/...)
    current_element = death_heading.find_parent(["h2", "h3", "h4", "h5", "h6"])

    if current_element:
        # 3️⃣ Iterate through following siblings at same level
        for sibling in current_element.find_next_siblings():
            # Stop when next section heading starts
            if sibling.name and sibling.name.startswith("h"):
                break

            # Paragraphs inside the Death section
            if sibling.name == "p":
                text = sibling.get_text(strip=True)
                if text:
                    death_section_content.append(text)

            # Lists inside the Death section (bullets)
            elif sibling.name in ("ul", "ol"):
                for li in sibling.find_all("li"):
                    text = li.get_text(strip=True)
                    if text:
                        death_section_content.append(text)

# 4️⃣ Join and print
full_death_content = "\n\n".join(death_section_content)

if full_death_content:
    print(full_death_content)
else:
    print("Still could not extract Death section – check the page URL / HTML structure.")


Google Search link Generator

In [None]:
title=str(input('Enter the title of the page: '))
link='https://www.google.com/search?q='+title + ' wikipedia'
print(link)


Enter the title of the page: tajmahal
https://www.google.com/search?q=tajmahal wikipedia


In [None]:
res=requests.get('https://www.google.com/search?q=Mahatma+Gandhi+wikipedia')
soup=BeautifulSoup(res.text,'html.parser')

In [None]:
heading=soup.find('h1').text
print(heading)

Mahatma Gandhi


In [None]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
res = requests.get('https://en.wikipedia.org/wiki/Mahatma_Gandhi', headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')

heading = soup.find('h1').text
print(heading)

Mahatma Gandhi


In [None]:
import re

corpus = ""
for p in soup.find_all('p'):
    corpus += p.get_text() + "\n"

corpus = corpus.strip()

# remove any numeric reference like [1], [23], [450]
corpus = re.sub(r'\[\d+\]', '', corpus)


In [None]:
print(corpus)

Mohandas Karamchand Gandhi[c] (2 October 1869 – 30 January 1948) was an Indian lawyer, anti-colonial nationalist, and political ethicist who employed nonviolent resistance to lead the successful campaign for India's independence from British rule. He inspired movements for civil rights and freedom across the world. The honorific Mahātmā (from Sanskrit, meaning great-souled, or venerable), first applied to him in South Africa in 1914, is used worldwide.

Born and raised in a Hindu family in coastal Gujarat, Gandhi was trained in the law at the Inner Temple in London and was called to the bar at the age of 22. After two uncertain years in India, where he was unable to start a successful law practice, Gandhi moved to South Africa in 1893 to represent an Indian merchant in a lawsuit. He went on to live in South Africa for the next 21 years. Here, Gandhi raised a family and first employed nonviolent resistance in a campaign for civil rights. In 1915, aged 45, he returned to India and soon s

Generate link for all the available content in wikipedia

```
`# This is formatted as code`
```



In [None]:
title=str(input('Enter the topic: ')).replace(' ','+')
link='https://www.google.com/search?q='+title + ' wikipedia'

res=requests.get(link)
soup=BeautifulSoup(res.text,'html.parser')


Enter the topic: gandhi


In [None]:
for sp in soup.find_all('div'):
  try:
    link=sp.find('a')['href']
    if ('en.wikipedia.org' in link):
       break
  except:
    pass
print(link[7:].split('&')[0])

Wikipedia link not found in search results. Google's HTML structure might have changed or the link is not present.


In [None]:
link


'/search?q=gandhi+wikipedia&sca_esv=8dfe9edb86176f37&ie=UTF-8&emsg=SG_REL&sei=D34waZrJJ5zap84P9qy74As'