# TUTORIAL Basic Web Scraping Tutorial with Beautiful Soup
## Scraping Books Data from books.toscrape.com

### Step 1: Install Required Packages

In [None]:
!pip install requests beautifulsoup4 pandas



### Step 2: Import Libraries

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

print("Libraries imported successfully!")

Libraries imported successfully!


### Step 3: Get the Webpage

In [None]:
# URL of the books website
url = "http://books.toscrape.com/"

# Get the webpage
response = requests.get(url)

# Check if request was successful
if response.status_code == 200:
    print("Successfully retrieved the webpage!")
    # Create BeautifulSoup object
    soup = BeautifulSoup(response.content, 'html.parser')
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Successfully retrieved the webpage!


### Step 4: Extract Book Information

In [None]:
# Find all book articles
books = soup.find_all('article', class_='product_pod')
print(f"Found {len(books)} books on the page")

# Create a list to store book data
books_data = []

# Extract information from each book
for book in books:
    # Get title (in the image's alt text)
    title = book.h3.a['title']

    # Get price (in a <p> tag with class 'price_color')
    price = book.find('p', class_='price_color').text.strip()

    # Get availability (in a <p> tag with class 'availability')
    availability = book.find('p', class_='availability').text.strip()

    # Get rating (in the class attribute of <p> tag with class 'star-rating')
    rating = book.find('p', class_='star-rating')['class'][1]

    # Store the data
    book_info = {
        'Title': title,
        'Price': price,
        'Availability': availability,
        'Rating': rating
    }

    books_data.append(book_info)

# Create DataFrame
df = pd.DataFrame(books_data)

# Display first few rows
print("\nFirst 5 books:")
display(df.head())

Found 20 books on the page

First 5 books:


Unnamed: 0,Title,Price,Availability,Rating
0,A Light in the Attic,£51.77,In stock,Three
1,Tipping the Velvet,£53.74,In stock,One
2,Soumission,£50.10,In stock,One
3,Sharp Objects,£47.82,In stock,Four
4,Sapiens: A Brief History of Humankind,£54.23,In stock,Five


### Step 5: Clean the Data

In [None]:
# Clean price (remove '£' symbol and convert to float)
df['Price'] = df['Price'].str.replace('£', '').astype(float)

# Clean availability (extract number of books)
df['Availability'] = df['Availability'].str.extract('(\d+)')

# Display cleaned data
print("Cleaned data:")
display(df.head())

Cleaned data:


Unnamed: 0,Title,Price,Availability,Rating
0,A Light in the Attic,51.77,,Three
1,Tipping the Velvet,53.74,,One
2,Soumission,50.1,,One
3,Sharp Objects,47.82,,Four
4,Sapiens: A Brief History of Humankind,54.23,,Five


### Step 6: Save to CSV File

In [None]:
# Save to CSV
df.to_csv('books_data.csv', index=False)
print("\nData saved to 'books_data.csv'")

# Verify the saved data
print("\nVerifying saved data:")
saved_df = pd.read_csv('books_data.csv')
display(saved_df.head())


Data saved to 'books_data.csv'

Verifying saved data:


Unnamed: 0,Title,Price,Availability,Rating
0,A Light in the Attic,51.77,,Three
1,Tipping the Velvet,53.74,,One
2,Soumission,50.1,,One
3,Sharp Objects,47.82,,Four
4,Sapiens: A Brief History of Humankind,54.23,,Five


# Beginner-Friendly Web Scraping Problems [10 points each]

## Problem 1: Scrape Book Titles and Prices
**Objective**: Extract a list of book titles and their corresponding prices from [Books to Scrape](http://books.toscrape.com).

### Steps:
1. Navigate to the homepage of the website.
2. Identify all book titles and prices listed on the page.
3. Save the data into a CSV file with two columns: `Title` and `Price`.

---

## Problem 2: Scrape Top 10 Quotes from [Quotes to Scrape](http://quotes.toscrape.com)
**Objective**: Extract the top 10 quotes, their authors, and the associated tags from [Quotes to Scrape](http://quotes.toscrape.com).

### Steps:
1. Go to the homepage of the website.
2. Extract the text of the first 10 quotes, their authors, and the tags associated with each quote.
3. Save the data in a CSV file with three columns: `Quote`, `Author`, and `Tags`.

---

## Problem 3: Scrape Weather Data from [World Weather Online](https://www.timeanddate.com/weather/)
**Objective**: Extract the current weather conditions (temperature, weather condition, and humidity) for a given city.

### Steps:
1. Visit [https://www.timeanddate.com/weather/](https://www.timeanddate.com/weather/).
2. Search for the weather data for a city (e.g., New York).
3. Extract the current temperature, weather description, and humidity levels.
4. Save the data in a structured format (e.g., a JSON or CSV file).


In [None]:
## Problem 1
url = "http://books.toscrape.com/"
## Finding the response
response = requests.get(url)
## Checking if the response if properly fetched and if it is, getting the html code using soup
if response.status_code==200:
  print("Successfully retrieved the webpage!")
  soup = BeautifulSoup(response.content, 'html.parser')
else:
  print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Successfully retrieved the webpage!


In [None]:
# Find all book articles
books = soup.find_all('article', class_='product_pod')
print(f"Found {len(books)} books on the page")

# Create a list to store book data
books_data = []

# Extract information from each book
for book in books:
    # Get title (in the image's alt text)
    title = book.h3.a['title']

    # Get price (in a <p> tag with class 'price_color')
    price = book.find('p', class_='price_color').text.strip()

    # Store the data
    book_info = {
        'Title': title,
        'Price': price,
    }

    books_data.append(book_info)

# Create DataFrame
df = pd.DataFrame(books_data)

# Display first few rows
print("\nFirst 5 books:")
display(df.head())

Found 20 books on the page

First 5 books:


Unnamed: 0,Title,Price
0,A Light in the Attic,£51.77
1,Tipping the Velvet,£53.74
2,Soumission,£50.10
3,Sharp Objects,£47.82
4,Sapiens: A Brief History of Humankind,£54.23


In [None]:
## Cleaning the data
df['Price'] = df['Price'].str.replace('£', '').astype(float)

In [None]:
print("Cleaned Data:")
df.head()

Cleaned Data:


Unnamed: 0,Title,Price
0,A Light in the Attic,51.77
1,Tipping the Velvet,53.74
2,Soumission,50.1
3,Sharp Objects,47.82
4,Sapiens: A Brief History of Humankind,54.23


In [None]:
# Save to CSV
df.to_csv('books_data_1.csv', index=False)
print("\nData saved to 'books_data_1.csv'")

# Verify the saved data
print("\nVerifying saved data:")
saved_df = pd.read_csv('books_data_1.csv')
display(saved_df.head())


Data saved to 'books_data_1.csv'

Verifying saved data:


Unnamed: 0,Title,Price
0,A Light in the Attic,51.77
1,Tipping the Velvet,53.74
2,Soumission,50.1
3,Sharp Objects,47.82
4,Sapiens: A Brief History of Humankind,54.23


In [None]:
## Problem 2
url2 = "https://quotes.toscrape.com/"
## Finding the response
response = requests.get(url2)
## Checking the status code of the response
if response.status_code==200:
  print("Successfully retrieved the webpage!")
else:
  print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Successfully retrieved the webpage!


In [None]:
## Checking the content type for the response so that we can find appropriate way of parsing the data
contentType = response.headers.get("Content-Type")

In [None]:
contentType

'text/html; charset=utf-8'

In [None]:
## Since the content-type is text/html, we cannot use the json parser to extract the data out of the response and therefore we have to scrape the data out of the text of response

In [None]:
response.content

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n    \n    \n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinki

In [None]:
## Finding the html code
soup = BeautifulSoup(response.content,"html.parser")

In [None]:
## Finding all the elememts which are div and has the "quote" class
quotes= soup.find_all("div",class_="quote")

In [None]:
## Finding the first 10 such elements
req_quotes = quotes[:10]

In [None]:
req_quotes

[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
 <a class="tag" href="/tag/change/page/1/">change</a>
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>
 <a class="tag" href="/tag/world/page/1/">world</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
 <span>by <small class="author" itempr

In [None]:
## Lists for storing the quotes, authors, tags
quote_text_list = []
authors_list = []
tags_list = []

In [None]:
## FInding the quotes, authors and tags out of the first 10 elements and storing them in lists
for index,quote in enumerate(req_quotes,start=1):
  ## Getting the spans which has  the "text" class
  spans = quote.find_all("span",class_="text")
  ## Getting the first span because it contains the quote
  quote_span = spans[0]
  ## Getting the quote as the text
  quote_text = quote_span.get_text(strip=True)
  ## Appending the extracted quote into the list
  quote_text_list.append(quote_text)

  ## Using the same approach for the authors
  small = quote.find("small",class_="author")
  author = small.get_text(strip=True)
  authors_list.append(author)

  ## Using the same sort of approach for the tags but here we are storing the list of tags into the final list
  tags = quote.find_all("a",class_="tag")
  tag_list = []
  for tag in  tags:
    tag_list.append(tag.get_text(strip=True))
  tags_list.append(tag_list)

In [None]:
## Constructing the data
data = {
    "Quote": quote_text_list,
    "Author":authors_list,
    "Tags":tags_list
}

In [None]:
df=pd.DataFrame(data)
## COnverting the data into the dataframe

In [None]:
df.head()

Unnamed: 0,Quote,Author,Tags
0,“The world as we have created it is a process ...,Albert Einstein,"[change, deep-thoughts, thinking, world]"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"[abilities, choices]"
2,“There are only two ways to live your life. On...,Albert Einstein,"[inspirational, life, live, miracle, miracles]"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"[aliteracy, books, classic, humor]"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"[be-yourself, inspirational]"


In [None]:
df.to_csv("quotes.csv",index=False)
## Storing the data into csv

In [None]:
df_read=pd.read_csv("quotes.csv")
##Reading the csv for rechecking

In [None]:
df_read.head()

Unnamed: 0,Quote,Author,Tags
0,“The world as we have created it is a process ...,Albert Einstein,"['change', 'deep-thoughts', 'thinking', 'world']"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"['abilities', 'choices']"
2,“There are only two ways to live your life. On...,Albert Einstein,"['inspirational', 'life', 'live', 'miracle', '..."
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"['aliteracy', 'books', 'classic', 'humor']"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"['be-yourself', 'inspirational']"


In [None]:
## the approach of problem 3 is similar to problem 2 and therefore i am not writing the comments again

In [None]:
## Problem 3
url3 = "https://www.timeanddate.com/weather/usa/new-york"
response = requests.get(url3)
if response.status_code==200:
  print("Successfully retrieved the webpage!")
else:
  print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Successfully retrieved the webpage!


In [None]:
contentType = response.headers.get("Content-Type")

In [None]:
contentType

'text/html; charset=UTF-8'

In [None]:
## Since the content type is text/html, we will use web scraping to find the data

In [None]:
soup = BeautifulSoup(response.content,"html.parser")

In [None]:
soup

<!DOCTYPE html>
<!--
scripts and programs that download content transparent to the user are not allowed without permission
--><html lang="en"><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>Weather for New York, New York, USA</title><meta content="Current weather in New York and forecast for today, tomorrow, and next 14 days" name="description"/><meta content="max-image-preview:large" name="robots"/><meta content="https://www.timeanddate.com/scripts/cityog.php?title=Weather%20in&amp;tint=0x007b7a&amp;city=New%20York&amp;state=New%20York&amp;country=USA&amp;image=new-york1" property="og:image"/><meta content="1366" property="og:image:width"/><meta content="738" property="og:image:height"/><meta content="website" property="og:type"/><style>
@font-face{font-family:iconfont;src:url("/common/fonts/iconfont.woff2?v8") format("woff2"),url("/common/fonts/iconfont.woff?v8") format("woff"),url("/common/fonts/iconfont.ttf?v8") format("truetype"),url("/common/fonts

In [None]:
divs = soup.find_all("div",class_="h2")

In [None]:
divs

[<div class="h2">0 °C</div>]

In [None]:
tempDiv=divs[0]

In [None]:
tempDiv

<div class="h2">0 °C</div>

In [None]:
temp = tempDiv.get_text(strip=True)

In [None]:
temp

'0\xa0°C'

In [None]:
print(temp)

0 °C


In [None]:
ps = soup.find_all("p")

In [None]:
ps

[<p>Clear.</p>,
 <p>Feels Like: -3 °C<br/><span title="High and low forecasted temperature today">Forecast: 4 / -1 °C</span><br/>Wind: 7 km/h <span class="comp sa8" title="Wind blowing from 280° West to East">↑</span> from West</p>,
 <p class="lk-block"><a class="read-more" href="/weather/usa/new-york/hourly">See more hour-by-hour weather</a></p>,
 <p class="lk-block"><a class="read-more" href="/weather/usa/new-york/ext">14 day forecast, day-by-day</a><a class="fr read-more mgr15" href="/weather/usa/new-york/hourly">Hour-by-hour forecast for next week</a></p>,
 <p>Clear. 6 / -1 °C<br/>Humidity: 44%. Wind: 11 km/h <span class="comp sa8" title="Wind blowing from 270° West to East">↑</span> from West</p>,
 <p class="tr lk-block pdr25"><a class="read-more" href="/weather/usa/new-york/historic">More weather last week</a></p>,
 <p>Passing clouds.</p>,
 <p>Clear.</p>,
 <p>Passing clouds.</p>,
 <p class="tr lk-block clear pdr25"><a class="read-more" href="/weather/usa">More weather in USA</a><

In [None]:
descriptionP = ps[0]

In [None]:
descriptionP

<p>Clear.</p>

In [None]:
description=descriptionP.get_text(strip=True)

In [None]:
description

'Clear.'

In [None]:
tds = soup.find_all("td")

In [None]:
tds

[<td>New York City - Central Park</td>,
 <td id="wtct">27 Jan 2025, 07:24:00</td>,
 <td>27 Jan 2025, 06:51</td>,
 <td>16 km</td>,
 <td>1017 mbar</td>,
 <td>41%</td>,
 <td>-12 °C</td>,
 <td>Now</td>,
 <td>08:00</td>,
 <td>09:00</td>,
 <td>10:00</td>,
 <td>11:00</td>,
 <td>12:00</td>,
 <td><img class="mtt" height="80" src="//c.tadst.com/gfx/w/svg/wt-1.svg" title="Clear." width="80"/></td>,
 <td><img class="mtt" height="80" src="//c.tadst.com/gfx/w/svg/wt-2.svg" title="Mostly sunny." width="80"/></td>,
 <td><img class="mtt" height="80" src="//c.tadst.com/gfx/w/svg/wt-1.svg" title="Sunny." width="80"/></td>,
 <td><img class="mtt" height="80" src="//c.tadst.com/gfx/w/svg/wt-1.svg" title="Sunny." width="80"/></td>,
 <td><img class="mtt" height="80" src="//c.tadst.com/gfx/w/svg/wt-1.svg" title="Sunny." width="80"/></td>,
 <td><img class="mtt" height="80" src="//c.tadst.com/gfx/w/svg/wt-1.svg" title="Sunny." width="80"/></td>,
 <td>0 °C</td>,
 <td>-1 °C</td>,
 <td>0 °C</td>,
 <td>1 °C</td>,
 <

In [None]:
humidityTd = tds[5]

In [None]:
humidityTd

<td>41%</td>

In [None]:
humidity = humidityTd.get_text(strip=True)

In [None]:
humidity

'41%'

In [None]:
data = {
    "Temp":temp,
    "Description":description,
    "Humidity":humidity
}

In [None]:
df = pd.DataFrame(data,index=[0])

In [None]:
df

Unnamed: 0,Temp,Description,Humidity
0,0 °C,Clear.,41%


In [None]:
df.to_csv("weather.csv",index=False)

In [None]:
dfRead = pd.read_csv("weather.csv")

In [None]:
dfRead

Unnamed: 0,Temp,Description,Humidity
0,0 °C,Clear.,41%


**Pandas Assignment [10 points each]**

1. Create a DataFrame df from this dictionary data which has the index labels and Display a summary of the basic information about this DataFrame and its data.

In [None]:
import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, 40],
    "City": ["New York", "Los Angeles", "Chicago", "Houston"],
    "Score": [85, 90, 95, 100]
}

df = pd.DataFrame(data, index=["A", "B", "C", "D"])

# Displaying a summary of the DataFrame
print("Basic Information:")
print(df.info())

print("\nSummary Statistics:")
print(df.describe())


Basic Information:
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, A to D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   City    4 non-null      object
 3   Score   4 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 160.0+ bytes
None

Summary Statistics:
             Age       Score
count   4.000000    4.000000
mean   32.500000   92.500000
std     6.454972    6.454972
min    25.000000   85.000000
25%    28.750000   88.750000
50%    32.500000   92.500000
75%    36.250000   96.250000
max    40.000000  100.000000


2. Return the first 5 rows of the DataFrame df.

In [None]:
df.head()

Unnamed: 0,Name,Age,City,Score
A,Alice,25,New York,85
B,Bob,30,Los Angeles,90
C,Charlie,35,Chicago,95
D,David,40,Houston,100


3. Explain Pandas DataFrame Using Python List

In [None]:
## Single List
data = [10, 20, 30, 40]

df = pd.DataFrame(data, columns=["Values"])

print(df)


   Values
0      10
1      20
2      30
3      40


In [None]:
# List of lists
data = [
    [1, "Alice", 85],
    [2, "Bob", 90],
    [3, "Charlie", 95],
    [4, "David", 100]
]

df = pd.DataFrame(data, columns=["ID", "Name", "Score"])

print(df)


   ID     Name  Score
0   1    Alice     85
1   2      Bob     90
2   3  Charlie     95
3   4    David    100


4. How we can rename an index using the rename() method.

In [None]:
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, 40]
}
df = pd.DataFrame(data, index=["A", "B", "C", "D"])

print("Original DataFrame:")
print(df)

# Renaming index labels
renamed_df = df.rename(index={"A": "Alpha", "B": "Beta", "C": "Gamma", "D": "Delta"})

print("\nDataFrame with Renamed Index:")
print(renamed_df)


Original DataFrame:
      Name  Age
A    Alice   25
B      Bob   30
C  Charlie   35
D    David   40

DataFrame with Renamed Index:
          Name  Age
Alpha    Alice   25
Beta       Bob   30
Gamma  Charlie   35
Delta    David   40


5. You have a 2D NumPy array that you have converted into a pandas DataFrame. You want to assign specific index values to the rows of this DataFrame. If you pass a list of index values to the DataFrame, how does it affect the DataFrame, and how would you apply these index values?

In [None]:
import numpy as np

# Create a 2D NumPy array
data = np.array([[10, 20, 30], [40, 50, 60], [70, 80, 90]])

# Assign specific index values during DataFrame creation
df = pd.DataFrame(data, columns=["A", "B", "C"], index=["Row1", "Row2", "Row3"])

print("DataFrame with Assigned Index:")
print(df)


DataFrame with Assigned Index:
       A   B   C
Row1  10  20  30
Row2  40  50  60
Row3  70  80  90


In [None]:
# Create the DataFrame without assigning an index
df = pd.DataFrame(data, columns=["A", "B", "C"])

# Modify the index after creation
df.index = ["Row1", "Row2", "Row3"]

print("\nModified Index:")
print(df)



Modified Index:
       A   B   C
Row1  10  20  30
Row2  40  50  60
Row3  70  80  90


6. You have a dictionary of data that you want to store as a pandas Series. After creating the Series and storing it in the df variable, you print it and observe that the data is represented in a one-dimensional linear format. Explain how to create this Series from the dictionary and describe the output you would expect when printing the Series.

In [None]:
# Dictionary of data
data = {"Alice": 85, "Bob": 90, "Charlie": 95, "David": 100}

# Create the Series
df = pd.Series(data)

# Print the Series
print(df)


Alice       85
Bob         90
Charlie     95
David      100
dtype: int64


7. You create a dictionary and store it as a DataFrame in the df variable. After printing, the data appears as 2-dimensional rows and columns. How would you create this DataFrame from the dictionary, and what does the output look like?

In [None]:
# Dictionary of data
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, 40],
    "City": ["New York", "Los Angeles", "Chicago", "Houston"],
    "Score": [85, 90, 95, 100]
}

# Create the DataFrame
df = pd.DataFrame(data)

# Print the DataFrame
print(df)


      Name  Age         City  Score
0    Alice   25     New York     85
1      Bob   30  Los Angeles     90
2  Charlie   35      Chicago     95
3    David   40      Houston    100
