<a href="https://colab.research.google.com/github/amaan-zafar/data-science-python/blob/1-Web-Scraping/Web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Importing Libraries
import requests
from bs4 import BeautifulSoup

**User-defined Functions**

In [None]:
# To get a soup (html page) from a url
def make_soup(url):
  page = requests.get(url)
  soup = BeautifulSoup(page.content, "html.parser")
  return soup

# To get all texts (after removing empty strings and stripping) from a tag/list of tags
def get_texts_from_tags(tags, soup):
  tag_texts = []
  for text in soup.find_all(tags):
    tag_texts.append(text.getText().strip())

  # Return after removing all empty strings
  return list(filter(None, tag_texts))

In [None]:
URL = "https://www.shiksha.com/humanities-social-sciences/articles/top-10-colleges-in-india-mhrd-blogId-15781"
soup = make_soup(URL)

**Finding all text in Paragraph (p) tags**

In [None]:
all_text = get_texts_from_tags(['p'], soup)
all_text

['Delhi University’s Miranda House has been ranked as the best college in the country. Find out which are the top 10 colleges in India as per MHRD-NIRF India Rankings 2020.',
 'The Ministry of Human Resource Development (MHRD), Government of India has released the India Rankings 2020 on June 11 with support from National Institutional Ranking Framework (NIRF). The framework for India rankings was constituted in 2015. The NIRF-MRHD released the India rankings for the first time in 2016.',
 'In the year 2020, over 1,600 institutes participated in the NIRF-MHRD India Rankings. In these rankings, the educational institutes are evaluated across parameters such as – Teaching, Learning and Resources (TLR), Research, Professional Practice (RP), Graduation Outcome (GO), Outreach and Inclusivity (OI) and Perception (PR). Under each of these parameters, educational institutes are evaluated out of a score of 100.',
 'As part of India rankings 2020, NIRF-MHRD has ranked the top educational institut

**Listing out all text in Header Tags**

In [None]:
h_list = get_texts_from_tags(['h1', 'h2', 'h3', 'h4'], soup)
h_list

['Top 10 Colleges in India 2020 by MHRD: Ranking, Placement, Fees',
 'NIRF Rankings 2020: Top 10 Colleges in India',
 'Rank 1 – Miranda House',
 'Rank 2 – Lady Shri Ram College for Women',
 'Rank 3 – Hindu College',
 'Rank 4 – St Stephen’s College',
 'Rank 5 – Presidency College',
 'Rank 6 – Loyola College',
 'Rank 7\xa0–\xa0St Xavier’s College, Kolkata',
 'Rank 7 – Ramakrishna Mission Vidyamandira',
 'Rank 9 – Hans Raj College',
 'Rank 10 – PSGR Krishnammal College for Women',
 'Comments',
 'Shiksha',
 'Colleges',
 'Others',
 'Our Group']

**Finding contents from (table) tag**

In [None]:
#Contents only in Table Header
th_list = get_texts_from_tags(['th'], soup)
th_list

['Name', 'NIRF 2020 Rank', 'NIRF 2019 Rank']

In [None]:
# Finding only that table for which table header exists. 
# NOTE : Specific to this webpage only
table_body = soup.find_all('th')[0].find_parent('tbody')

In [None]:
# Contents in Table Details of table_body
td_list = []
for h in table_body.find_all('td'):
  td_list.append(h.getText().strip().split('\n'))

td_list

[['Miranda House'],
 ['1'],
 ['1'],
 ['Lady Shri Ram College for Women'],
 ['2'],
 ['5'],
 ['Hindu College'],
 ['3'],
 ['2'],
 ['St Stephen’s College'],
 ['4'],
 ['4'],
 ['Presidency College'],
 ['5'],
 ['3'],
 ['Loyola College'],
 ['6'],
 ['6'],
 ['St Xavier’s College'],
 ['7'],
 ['10'],
 ['Ramakrishna Mission Vidyamandira'],
 ['7'],
 ['11'],
 ['Hansraj College'],
 ['9'],
 ['9'],
 ['PSGR Krishnammal College for Women'],
 ['10'],
 ['22']]

**For obtaining structured Content , we use Table Row (tr) and obtain corresponding links using href tag**

In [None]:
tr_list = []
# To store list of college web links
url_list = []

for tr in table_body.find_all('tr'):
  for a in tr.find_all('a', href=True):
    url_list.append(a['href'])
    text_str = str(tr.getText().strip()+'\n'+a['href'])
    tr_list.append(text_str.split('\n'))

tr_list

[['Miranda House ',
  ' 1 ',
  ' 1',
  'https://www.shiksha.com/college/miranda-house-north-campus-delhi-3090'],
 ['Lady Shri Ram College for Women ',
  ' 2\xa0 ',
  ' 5',
  'https://www.shiksha.com/college/lady-shri-ram-college-for-women-lajpat-nagar-delhi-23895'],
 ['Hindu College ',
  ' 3 ',
  ' 2',
  'https://www.shiksha.com/college/the-hindu-college-guntur-21399'],
 ['St Stephen’s College ',
  ' 4 ',
  ' 4',
  'https://www.shiksha.com/college/st-stephen-s-college-north-campus-delhi-23933'],
 ['Presidency College ',
  ' 5 ',
  ' 3',
  'https://www.shiksha.com/college/presidency-college-chennai-chepauk-20896'],
 ['Loyola College ',
  ' 6 ',
  ' 6',
  'https://www.shiksha.com/college/loyola-college-nungambakkam-chennai-1108'],
 ['St Xavier’s College ',
  ' 7 ',
  ' 10',
  'https://www.shiksha.com/college/st-xavier-s-college-kolkata-park-street-23242'],
 ['Ramakrishna Mission Vidyamandira ',
  ' 7 ',
  ' 11',
  'https://www.shiksha.com/college/ramakrishna-mission-vidyamandira-howrah-5

**Finding description of all 10 colleges given in paragraph (p) tags**

In [None]:
# NOTE : Specific to this webpage only
college_info = []
for i in soup.find_all('h3'):
  temp_str=''
  for j in i.find_all_next(['p', 'h3']):
    if (j.name=='p'):
      if (j.find_all('strong')):
        college_info.append(temp_str)
        break
      temp_str = temp_str + j.getText()
    elif(j.name=='h3'):
      college_info.append(temp_str)
      temp_str=''
      break

# To remove all empty strings in all_text
college_info = list(filter(None, college_info))
college_info


['Holding on to the top spot again is Miranda House. The college is known to be a pioneer in Science education as it was one of the few educational institutes that offered BSc (Hons) Botany course to students in 1948. Located at DU’s North Campus, the college was founded by Maurice Gwyerand and it was designed by Walter Sykes George. . At present Miranda House has 18 Departments and students are offered courses in Social Sciences, Humanities and Basic Sciences.\xa0An all girl’s college, Miranda House was ranked to be the best college in India for the fourth time in a row in NIRF-MHRD India Rankings 2020.',
 'Established in 1956 by Sir Shri Ram in memory of his wife, Lady Shri Ram College for Women (LSR) started functioning from a school building in Daryaganj, Central Delhi. The college was set up with an aim to offer women higher education. LSR College is part of DU’s South Campus and is known to have the best Arts faculty. The college campus is spread over an area of 15 acres and offe

In [None]:
# Adding college information to tr_list

for i in range(10):
  tr_list[i].append(college_info[i])
tr_list

[['Miranda House ',
  ' 1 ',
  ' 1',
  'https://www.shiksha.com/college/miranda-house-north-campus-delhi-3090',
  'Holding on to the top spot again is Miranda House. The college is known to be a pioneer in Science education as it was one of the few educational institutes that offered BSc (Hons) Botany course to students in 1948. Located at DU’s North Campus, the college was founded by Maurice Gwyerand and it was designed by Walter Sykes George. . At present Miranda House has 18 Departments and students are offered courses in Social Sciences, Humanities and Basic Sciences.\xa0An all girl’s college, Miranda House was ranked to be the best college in India for the fourth time in a row in NIRF-MHRD India Rankings 2020.'],
 ['Lady Shri Ram College for Women ',
  ' 2\xa0 ',
  ' 5',
  'https://www.shiksha.com/college/lady-shri-ram-college-for-women-lajpat-nagar-delhi-23895',
  'Established in 1956 by Sir Shri Ram in memory of his wife, Lady Shri Ram College for Women (LSR) started functioning

**Converting to Pandas dataframe, for neat structure**

In [None]:
import pandas as pd
th_list.append('URL')
th_list.append('Information')
df_data = pd.DataFrame(tr_list, columns=th_list)
df_data

Unnamed: 0,Name,NIRF 2020 Rank,NIRF 2019 Rank,URL,Information
0,Miranda House,1,1,https://www.shiksha.com/college/miranda-house-...,Holding on to the top spot again is Miranda Ho...
1,Lady Shri Ram College for Women,2,5,https://www.shiksha.com/college/lady-shri-ram-...,Established in 1956 by Sir Shri Ram in memory ...
2,Hindu College,3,2,https://www.shiksha.com/college/the-hindu-coll...,"Established in 1899, Hindu College was initial..."
3,St Stephen’s College,4,4,https://www.shiksha.com/college/st-stephen-s-c...,Founded in 1881 by the Cambridge Mission to De...
4,Presidency College,5,3,https://www.shiksha.com/college/presidency-col...,Considered to be the mother body of University...
5,Loyola College,6,6,https://www.shiksha.com/college/loyola-college...,Established in 1925 by the Society of Jesus (J...
6,St Xavier’s College,7,10,https://www.shiksha.com/college/st-xavier-s-co...,St Xavier’s College was founded in 1860 by a C...
7,Ramakrishna Mission Vidyamandira,7,11,https://www.shiksha.com/college/ramakrishna-mi...,"Founded by the Ramakrishna Mission in 1941, Ra..."
8,Hansraj College,9,9,https://www.shiksha.com/college/hansraj-colleg...,Founded by DAV College Managing Committee on J...
9,PSGR Krishnammal College for Women,10,22,https://www.shiksha.com/college/psgr-krishnamm...,"Established in 1963, PSGR Krishnammal College ..."


**Writing data to CSV**

In [None]:
# Writing data from dataframe to CSV directly
# df_data.to_csv('alternative_music_sites.csv',sep='|')

In [None]:
import csv
with open('_top10collegesinfo.csv', 'w') as f: 
    write = csv.writer(f)
    write.writerow(th_list) 
    write.writerows(tr_list)

**Recursive Scraping**

In [None]:
college_web_content=[]
for url in url_list:
  url_soup = make_soup(url)
  all_text_list = get_texts_from_tags(['p'], url_soup)
  all_text_str = ' '.join(all_text_list)
  college_web_content.append(all_text_str)
college_web_content[0]

"Ranked 2 for Science  by India Today 2021  +3 more Updated on Jun 12, 2021 Founded in 1948, Miranda House is one of the oldest and premier women's institutions under the University of Delhi. The college was founded by Lady Edwina Mountbatten to promote higher education for women. The college boasts of its teaching departments. Its Humanities and Social Sciences section comes with 13 departments which are considered best amongst the rest of the colleges of the University. Jun 24, 2021: Top 10 BSc, MSc colleges in India The college caters to over 3000 students and offers education in sciences and liberal arts. It has ranked as the best college across the nation by NIRF rankings in 2019. The college offers admission on the basis of merit decided by the cut-off lists issued by Delhi University. For the detailed adm The college caters to over 3000 students and offers education in sciences and liberal arts. It has ranked as the best college across the nation by NIRF rankings in 2019. The co

In [None]:
# Getting college name lists
college_name_list = []
for i in tr_list:
  college_name_list.append(i[0])
college_name_list

['Miranda House ',
 'Lady Shri Ram College for Women ',
 'Hindu College ',
 'St Stephen’s College ',
 'Presidency College ',
 'Loyola College ',
 'St Xavier’s College ',
 'Ramakrishna Mission Vidyamandira ',
 'Hansraj College ',
 'PSGR Krishnammal College for Women ']

In [None]:
cols = ['College Name', 'College Web Link', 'Web Page Content']

In [None]:
rows = []
for i in range(10):
  temp_list = [college_name_list[i], url_list[i], college_web_content[i]]
  rows.append(temp_list)
rows

[['Miranda House ',
  'https://www.shiksha.com/college/miranda-house-north-campus-delhi-3090',
  "Ranked 2 for Science  by India Today 2021  +3 more Updated on Jun 12, 2021 Founded in 1948, Miranda House is one of the oldest and premier women's institutions under the University of Delhi. The college was founded by Lady Edwina Mountbatten to promote higher education for women. The college boasts of its teaching departments. Its Humanities and Social Sciences section comes with 13 departments which are considered best amongst the rest of the colleges of the University. Jun 24, 2021: Top 10 BSc, MSc colleges in India The college caters to over 3000 students and offers education in sciences and liberal arts. It has ranked as the best college across the nation by NIRF rankings in 2019. The college offers admission on the basis of merit decided by the cut-off lists issued by Delhi University. For the detailed adm The college caters to over 3000 students and offers education in sciences and l

**Converting to Pandas dataframe, for neat structure**

In [None]:
df_data = pd.DataFrame(rows, columns=cols)
df_data

Unnamed: 0,College Name,College Web Link,Web Page Content
0,Miranda House,https://www.shiksha.com/college/miranda-house-...,Ranked 2 for Science by India Today 2021 +3 ...
1,Lady Shri Ram College for Women,https://www.shiksha.com/college/lady-shri-ram-...,Ranked 3 for Commerce by India Today 2021 +4...
2,Hindu College,https://www.shiksha.com/college/the-hindu-coll...,"Updated on Sep 5, 2014 Jun 30, 2021: Top 10 Co..."
3,St Stephen’s College,https://www.shiksha.com/college/st-stephen-s-c...,"Page not found. Sorry, the page you were looki..."
4,Presidency College,https://www.shiksha.com/college/presidency-col...,Ranked 24 for Arts by The Week 2019 Updated o...
5,Loyola College,https://www.shiksha.com/college/loyola-college...,Ranked 3 for BCA by India Today 2021 +9 more...
6,St Xavier’s College,https://www.shiksha.com/college/st-xavier-s-co...,Ranked 6 for Commerce by The Week 2020 +3 mo...
7,Ramakrishna Mission Vidyamandira,https://www.shiksha.com/college/ramakrishna-mi...,"Updated on Apr 12, 2019 Go to website... http:..."
8,Hansraj College,https://www.shiksha.com/college/hansraj-colleg...,Ranked 4 for Commerce by India Today 2021 +5...
9,PSGR Krishnammal College for Women,https://www.shiksha.com/college/psgr-krishnamm...,Ranked 96 for BBA by India Today 2021 Updated...


In [None]:
with open('_CollegeDetails.csv', 'w') as f: 
    write = csv.writer(f)
    write.writerow(cols) 
    write.writerows(rows)