<a href="https://colab.research.google.com/github/amasick/WebScrapper/blob/main/Schools_Details.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# CBSE Schools Scraper Documentation

## Introduction

This script is designed to scrape school details from the CBSE (Central Board of Secondary Education) website. It targets schools located outside India and extracts information such as school name, email, address, PIN code, management, and a "Read More" link.

## Requirements

The script requires the following Python libraries to be installed:

- `requests`: For making HTTP requests.
- `BeautifulSoup` (from `bs4` package): For parsing HTML content.
- `pandas`: For handling and exporting data to Excel.
- `re`: For regular expressions.

You can install these dependencies using the following:

```bash
pip install requests beautifulsoup4 pandas
```

## Script Overview

### Function: `scrape_school_details(catbox_div)`

This function takes a BeautifulSoup object representing a `div` element with the class 'catbox', which contains details about a school. It extracts the school name, email, address, PIN code, management, and the "Read More" link.

#### Parameters:
- `catbox_div` (BeautifulSoup object): The `div` element containing school details.

#### Returns:
- `dict`: A dictionary containing school details.

### Function: `scrape_all_schools(url)`

This function is responsible for scraping all schools' data from a given base URL. It iterates through multiple pages, collects data from each page using `scrape_school_details`, and aggregates the results.

#### Parameters:
- `url` (str): The base URL for the CBSE schools outside India.

#### Returns:
- `list`: A list of dictionaries, where each dictionary represents the details of a school.

### Function: `save_to_excel(data, output_file='Outside_India.xlsx')`

This function takes the scraped school data and saves it to an Excel file.

#### Parameters:
- `data` (list of dict): List of dictionaries containing school details.
- `output_file` (str, optional): The name of the output Excel file. Default is 'Outside_India.xlsx'.

### Main Section: `if __name__ == "__main__":`

The main section of the script sets the base URL, initializes an empty list (`all_schools_data`) to store scraped data, and iterates through pages to scrape school details. It then calls the `save_to_excel` function to export the data to an Excel file.

## Execution

To run the script, ensure you have the required dependencies installed and execute it using:

```bash
python script_name.py
```

Make sure to replace `script_name.py` with the actual name of your Python script.

## Customization

- **Base URL**: You can modify the `base_url` variable to target a different location or category of schools.

- **Number of Pages**: Adjust the loop range in the main section (`for page_number in range(1, 5):`) based on the actual number of pages.

- **Output File**: You can change the default output file name in the `save_to_excel` function if needed.

## Conclusion

This script provides a scalable solution for scraping CBSE school details, and you can customize it to meet specific requirements.

In [2]:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re


def scrape_school_details(catbox_div):
    school_name = catbox_div.find('h2').find('a').text.strip()

    # Find the paragraph containing school details
    school_details_paragraph = catbox_div.find('p')

    # Check if the paragraph is found before accessing its content
    if school_details_paragraph:
        # Use regex to extract email
        email_match = re.search(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b', school_details_paragraph.text)
        email = email_match.group(0).strip() if email_match else None


        # Use regex to extract address
        address_match = re.search(r'Address of the school is:(.*?)PIN Code:', school_details_paragraph.text)
        address = address_match.group(1).strip() if address_match else None
        # print(address)

        # Use regex to extract PIN code
        pin_code_match = re.search(r'PIN Code:\s*([^<.]+)', school_details_paragraph.text)
        pin_code = pin_code_match.group(1).strip() if pin_code_match else None

        # Use regex to extract management
        management_match = re.search(r'The school is being managed by\s*([^<]+)\.', school_details_paragraph.text)
        management = management_match.group(1).strip() if management_match else None

        read_more_link = school_details_paragraph.find('a', class_='link')['href']

        return {
            'School Name': school_name,
            'Email': email,
            'Address': address,
            'PIN Code': pin_code,
            'Management': management,
            'Read More Link': read_more_link
        }
    else:
        print("School details paragraph not found.")
        return None



def scrape_all_schools(url):
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0'}

    # Make the GET request with custom headers
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        catbox_divs = soup.find_all('div', class_='catbox')

        all_schools_data = []
        for catbox_div in catbox_divs:
            school_data = scrape_school_details(catbox_div)
            all_schools_data.append(school_data)

        return all_schools_data

    else:
        print(f"Failed to fetch data. Status code: {response.status_code}")
        return None
def save_to_excel(data, output_file='TN_Schools.xlsx'):
    df = pd.DataFrame(data)
    df.to_excel(output_file, index=False)
    print(f"Data saved to {output_file}")

if __name__ == "__main__":
    base_url = "https://www.cbseschool.org/location/outside-india/page/"
    all_schools_data = []

    for page_number in range(1,5):  # Assuming you have 24 pages
        url = f"{base_url}{page_number}"
        page_schools_data = scrape_all_schools(url)

        if page_schools_data:
            all_schools_data.extend(page_schools_data)

    if all_schools_data:
        save_to_excel(all_schools_data)



Data saved to Outside_India.xlsx


# **Outside India Schools**

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re


def scrape_school_details(catbox_div):
    school_name = catbox_div.find('h2').find('a').text.strip()

    # Find the paragraph containing school details
    school_details_paragraph = catbox_div.find('p')

    # Check if the paragraph is found before accessing its content
    if school_details_paragraph:
        # Use regex to extract email
        email_match = re.search(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b', school_details_paragraph.text)
        email = email_match.group(0).strip() if email_match else None


        # Use regex to extract address
        address_match = re.search(r'Address of the school is:(.*?)PIN Code:', school_details_paragraph.text)
        address = address_match.group(1).strip() if address_match else None
        # print(address)

        # Use regex to extract PIN code
        pin_code_match = re.search(r'PIN Code:\s*([^<.]+)', school_details_paragraph.text)
        pin_code = pin_code_match.group(1).strip() if pin_code_match else None

        # Use regex to extract management
        management_match = re.search(r'The school is being managed by\s*([^<]+)\.', school_details_paragraph.text)
        management = management_match.group(1).strip() if management_match else None

        read_more_link = school_details_paragraph.find('a', class_='link')['href']

        return {
            'School Name': school_name,
            'Email': email,
            'Address': address,
            'PIN Code': pin_code,
            'Management': management,
            'Read More Link': read_more_link
        }
    else:
        print("School details paragraph not found.")
        return None



def scrape_all_schools(url):
    # headers = {
    #     'User-Agent': 'Thunder Client (https://www.thunderclient.com)',
    #     'Accept': '/'
    # }
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0'}


    # Make the GET request with custom headers
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        catbox_divs = soup.find_all('div', class_='catbox')

        all_schools_data = []
        for catbox_div in catbox_divs:
            school_data = scrape_school_details(catbox_div)
            all_schools_data.append(school_data)

        return all_schools_data

    else:
        print(f"Failed to fetch data. Status code: {response.status_code}")
        return None

def save_to_excel(data, output_file='all_schools_details12.xlsx'):
    df = pd.DataFrame(data)
    df.to_excel(output_file, index=False)
    print(f"Data saved to {output_file}")

if __name__ == "__main__":
    base_url = "https://www.cbseschool.org/location/outside-india/"
    all_schools_data = []

    for page_number in range(1, 25):  # Assuming you have 24 pages
        url = f"{base_url}{page_number}"
        page_schools_data = scrape_all_schools(url)

        if page_schools_data:
            all_schools_data.extend(page_schools_data)

    if all_schools_data:
        save_to_excel(all_schools_data)





In [None]:
# Importing necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape details of a school from a given li element
def scrape_school_details(li_element):
    # Find the div containing school details
    school_details_div = li_element.find('div', class_='edu-school-detlist-container')

    # Check if school details are found
    if school_details_div:
        # Extracting the school name
        school_name = school_details_div.find('h2', class_='edu-school-det-heading').text.strip()

        # Initializing a dictionary to store school details
        details = {'School Name': school_name}

        # Extracting details using a loop through all label elements
        for label_element in school_details_div.find_all('label'):
            label_text = label_element.text.strip().rstrip(':')
            # Extracting corresponding detail text
            detail_text = label_element.find_next('div', class_='edu-school-det-text').text.strip() if label_element else None
            details[label_text] = detail_text

        return details
    else:
        print("School details not found.")
        return None

# Function to scrape all schools from multiple pages
def scrape_all_schools(base_url, num_pages):
    # Setting headers to simulate a user-agent
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0'}

    # Initializing a list to store all schools' data
    all_schools_data = []

    # Looping through a range of page numbers
    for page_number in range(1, num_pages + 1):
        # Constructing the URL for the current page
        url = f"{base_url}?page={page_number}"

        # Making a GET request to the current page
        response = requests.get(url, headers=headers)

        # Checking if the request was successful (status code 200)
        if response.status_code == 200:
            # Parsing the HTML content using BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')

            # Finding the parent element containing all the li elements
            parent_element = soup.find('div', class_='edu-school-list-wrap')

            # Checking if the parent element is found
            if parent_element:
                # Finding all li elements within the parent element
                li_elements = parent_element.find_all('li')

                # Iterating through each li element and extracting school data
                for li_element in li_elements:
                    school_data = scrape_school_details(li_element)
                    if school_data:
                        all_schools_data.append(school_data)

            else:
                print(f"Parent element not found on page {page_number}.")

        # Handling the case where the page is not found (status code 404)
        elif response.status_code == 404:
            print(f"Page not found. Status code: {response.status_code} for page {page_number}")

    return all_schools_data

# Function to save scraped school data to an Excel file
def save_to_excel(data, output_file='all_schools_details0.xlsx'):
    # Extracting column names from the keys of the first dictionary in the data list
    columns = list(data[0].keys())

    # Creating a DataFrame from the data
    df = pd.DataFrame(data, columns=columns)

    # Saving the DataFrame to an Excel file without an index column
    df.to_excel(output_file, index=False)
    print(f"Data saved to {output_file}")

# Main execution section
if __name__ == "__main__":
    # Defining the base URL for scraping CBSE schools in foreign countries
    base_url = "https://www.careerindia.com/cbse-schools-in-foreign-schools-s11.html"

    # Specifying the number of pages to scrape (assuming 25 pages)
    num_pages = 24  # Assuming you have 25 pages

    # Initializing an empty list to store scraped data from all pages
    all_schools_data = []

    # Looping through a range of page numbers
    for page_number in range(1, num_pages + 1):
        # Calling the scrape_all_schools function to scrape data from the current page
        page_data = scrape_all_schools(base_url, page_number)

        # Checking if data was successfully scraped from the page
        if page_data:
            # Extending the list with data from the current page
            all_schools_data.extend(page_data)

    # Checking if any school data was collected
    if all_schools_data:
        # Calling the save_to_excel function to export the data to an Excel file
        save_to_excel(all_schools_data)
    else:
        print("No school data found.")


Data saved to all_schools_details0.xlsx
