
# EM 615: Programming for Data Science
**Indian Institute of Technology Gandhinagar**

**Author**: [Chandrabhan Patel](https://www.linkedin.com/in/cpatel321/)  
**Email**: [chandrabhan.patel@iitgn.ac.in](mailto:chandrabhan.patel@iitgn.ac.in)


---

## Introduction to Web Scraping
Web scraping is the process of extracting data from websites. It allows you to collect structured data from the web, which can be used for analysis, visualization, or other applications in `data science`. In this tutorial, we'll cover the basics of web scraping using Python and the BeautifulSoup library.
Also we will look at ethical considerations when scraping websites, along with overview of API's and how they can be used to extract data from websites.

URL to sample website used in this tutorial: [https://cpatel321.github.io/webscrapping-tutorial/index.html](https://cpatel321.github.io/webscrapping-tutorial/index.html)

### What is Web Scraping?
Web scraping involves the following steps:
1. Sending a request to a website.
2. Retrieving the content of the webpage.
3. Parsing the content to extract specific information.
4. Saving the extracted data for further use.

### Ethical Considerations
##### robots.txt
Before scraping, ensure the website permits it. Check the website's `robots.txt` file and adhere to its guidelines. 
More details about robots.txt can be found [here](https://www.robotstxt.org/robotstxt.html).

1. Ensure you have the right to scrape the website's content. Some websites have terms of service that prohibit web scraping.

2. Some websites have a creative commons license, which allows you to use the content for non-commercial purposes.

## How browsers work?

When you visit a website, your browser sends a request to the server hosting the website. The server then sends back the content of the webpage, which is rendered by the browser. The browser interprets the HTML, CSS, and JavaScript code to display the webpage as intended.
The protocol used for this communication is HTTP (HyperText Transfer Protocol) or HTTPS (HTTP Secure).
## How search engines index websites?
Search engines use web crawlers to index websites. These crawlers follow links on webpages to discover new content and index it in the search engine's database. When you search for a query, the search engine retrieves relevant results from its database and displays them to you.

`sitemap.xml` file is used to tell search engines about the structure of the website and the URLs that should be indexed.

### what are requests ?
The `requests` library in Python allows you to send HTTP requests to websites and retrieve their content. You can use it to fetch webpages, download files, and interact with web APIs.

# Setting Up the Environment for Scraping
We'll use the following libraries in this tutorial:

- `requests`: To send HTTP requests to fetch web pages.
- `BeautifulSoup`: To parse and extract data from HTML content.

Run the following cell to install the required libraries.

In [None]:
! pip install requests beautifulsoup4

# Step 1: Fetching the Webpage

We'll begin by sending a GET request to the demo website and fetching its HTML content. The `requests` library is used for this purpose.

In [3]:
import requests

# URL of the demo website
url = "https://cpatel321.github.io/webscrapping-tutorial/text.html"

response = requests.get(url)

# Check the status code
if response.status_code == 200:
    print("Page fetched successfully!")
    print(response.text[:])  # Print the first 500 characters of the HTML
else:
    print("Failed to fetch the page. Status code:", response.status_code)

Page fetched successfully!
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link rel="stylesheet" href="style.css">
    <title>DataLand - Text Content</title>
</head>
<body>
    <header>
        <h1>Random Text Content</h1>
        <nav>
            <ul>
                <li><a href="index.html">Home</a></li>
                <li><a href="table.html">Data Tables</a></li>
                <li><a href="text.html">Text Content</a></li>
                <li><a href="figures.html">Figures & Statistics</a></li>
            </ul>
        </nav>
    </header>
    <section>
        <article>
            <h2>Article 1: Random Facts</h2>
            <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
            <p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p>
 

# Step 2: Parsing the Webpage
We'll use BeautifulSoup to parse the HTML content and extract meaningful data. In this case, we aim to retrieve the text content of all articles from the demo webpage.

In [4]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser') # parse the HTML content using BeautifulSoup

articles = soup.find_all('article') # takes html tags as args, returns a list of all tags

# Extract and print the text from each article
for idx, article in enumerate(articles, start=1):
    heading = article.find('h2').get_text(strip=True) # stripe justifies the text by removing leading and trailing whitespaces
    content = article.find_all('p')
    print(f"Match ID {idx}: {heading}")
    for para in content:
        print(para.get_text(strip=True))
    print("\n")

Match ID 1: Article 1: Random Facts
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.


Match ID 2: Article 2: More Random Facts
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.


Match ID 3: Article 3: Fun Trivia
Did you know that honey never spoils? Archaeologists have found pots of honey in ancient Egyptian tombs that are over 3,000 years old and still perfectly edible.
Trivia can be a fun way to break the ice and learn something new every day!


Match ID 4: Article 4: Tech Tidbits
The first computer virus, named "Creeper," was created in 1971 as an experimental program to test self-replication.
Technology has come a long way since then, with

# Step 3: Saving Data
Extracted data can be saved to a file for further use. In this example, we'll save the article data to a text file.

In [3]:
# saving articles to a text file
with open("articles.txt", "w", encoding="utf-8") as f:
    for idx, article in enumerate(articles, start=1):
        heading = article.find('h2').get_text(strip=True)
        content = article.find_all('p')
        f.write(f"Article {idx}: {heading}\n")
        for para in content:
            f.write(para.get_text(strip=True) + "\n")
        f.write("\n")
print("Articles saved to articles.txt")

Articles saved to articles.txt


# Step 4: Scraping Tables
If the webpage contains tables, they can be extracted and saved as CSV files for easier analysis.

In [6]:
import csv #for parsing the data into csv and other type conversions

#Here we will again send a request to a different webpage containing tables and extract the data from the tables
table_url = "https://cpatel321.github.io/webscrapping-tutorial/table.html"
soup_table = BeautifulSoup(requests.get(table_url).text, 'html.parser')

tables = soup_table.find_all('table') # table is the tag used by most of the tables we see on websites

if tables: # list is not empty
    table = tables[0] # picking the first table 
    # print table data 
    rows = table.find_all('tr')

    # Open a CSV file for writing
    with open("table_data.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f) 
        
        
        #write each row to the CSV file
        for row in rows:
            cells = row.find_all(['th', 'td'])
            writer.writerow([cell.get_text(strip=True) for cell in cells])
            print([cell.get_text(strip=True) for cell in cells])
    print("\nTable data saved to table_data.csv")
else:
    print("No tables found on the page.")

['ID', 'Name', 'Age', 'Country']
['1', 'John Doe', '30', 'USA']
['2', 'Jane Smith', '25', 'Canada']

Table data saved to table_data.csv


In [None]:
from urllib.parse import urljoin # for image url joining

image_url = "https://cpatel321.github.io/webscrapping-tutorial/figures.html"


# Parse the HTML
soup_image = BeautifulSoup(requests.get(image_url).text, 'html.parser')

# Find all images and their associated text
images = soup_image.find_all('img')
text_data = soup_image.find_all(['p', 'ul'])

# Create a CSV file to store the scraped data
with open('images_and_text.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(["Image path", "Description"])  # Write the header

    # Loop through each image and corresponding text
    for image in images:
        img_url = image['src']
        # img_full_url = urljoin(image_url, img_url) # very useful function, saves a lot of manual work. used for joining the image url with the base url
        # # print("Image URL:", img_full_url)
        # img_response = requests.get(img_full_url, stream=True)
        # if img_response.status_code == 200:
        #         img_name = img_full_url.split("/")[-1]
        #         with open(img_name, 'wb') as img_file:
        #             for chunk in img_response.iter_content(1024):
        #                 img_file.write(chunk)

        # Find the associated description text (using next sibling or related paragraph)
        associated_text = ""
        for text in text_data:
            if text.find_parent('section'):
                associated_text = text.get_text(strip=True)
                break  # Stop at the first relevant description

        # Write the image URL and associated text to CSV
        writer.writerow([img_url, associated_text])

print("Data has been written to 'images_and_text.csv'.")

Data has been written to 'images_and_text.csv'.


In [7]:
# example with imdb website 
imdb_url = "https://www.imdb.com/chart/top"

# Parse the HTML
soup_imdb = BeautifulSoup(requests.get(imdb_url).text, 'html.parser')

# Find the list of top-rated movies
movies = soup_imdb.select('td.titleColumn')

# Create a CSV file to store the scraped data

with open('imdb_top_rated_movies.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(["Rank", "Title", "Year", "Rating"])  # Write the header

    # Loop through each movie
    for movie in movies:
        title = movie.find('a').get_text()
        year = movie.find('span').get_text()[1:5]
        rank = movie.get_text(strip=True).split('.')[0]
        rating = movie.find_next('strong').get_text()
        writer.writerow([rank, title, year, rating])

print("Data has been written to 'imdb_top_rated_movies.csv'.")


Data has been written to 'imdb_top_rated_movies.csv'.


# Final Notes

This tutorial demonstrated the basics of web scraping using Python. Remember to respect the terms of use of websites and avoid overloading servers with too many requests in a short time.