# üåê Web Scraping
### üéØ Ambition Box Data

---

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 20px; border-radius: 10px; margin: 20px 0;">
    <h4 style="color: white; margin: 0;">üì¶ Required Libraries</h4>
</div>

```python
import pandas as pd           # Data manipulation & analysis
import requests               # HTTP requests
from bs4 import BeautifulSoup # Web scraping powerhouse üöÄ
import numpy as np            # Numerical computing
```

### üìö Library Overview

| Library | Purpose | Key Features |
|---------|---------|--------------|
| **pandas** | Data Analysis | DataFrames, CSV handling, data manipulation |
| **requests** | HTTP Requests | GET/POST requests, API calls |
| **BeautifulSoup** | Web Scraping | HTML parsing, element extraction |
| **numpy** | Numerical Computing | Arrays, mathematical operations |

---


In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np

In [8]:
requests.get('https://www.carwale.com/new/best-cars-under-30-lakh').text

'\n\n<!DOCTYPE html><html lang="en" itemscope itemtype="http://schema.org/WebPage"\n\tprefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#"><head><meta charset="utf-8" /><meta name="theme-color" content="#0e3a50"><meta name="robots" content="max-snippet:-1, max-image-preview:large" />\n\t<title data-react-helmet="true" itemprop="name">Best Cars Under 30 Lakh in India| Top Cars Below 30 Lakh - CarWale</title>\n\t\n<link href="https://imgd.aeplcdn.com" rel="preconnect dns-prefetch" crossorigin><link href="https://stc.aeplcdn.com" rel="preconnect dns-prefetch" crossorigin><link href="https://www.googletagmanager.com" rel="preconnect dns-prefetch" crossorigin><link href="https://www.googletagservices.com" rel="preconnect dns-prefetch" crossorigin><link href="https://bid.g.doubleclick.net" rel="preconnect dns-prefetch" crossorigin><link href="https://pagead2.googlesyndication.com" rel="dns-prefetch" crossorigin><link href="https://googleads.g.doubleclick.net" rel="dns-prefetch" crossorigi

### üö´ Understanding 403 Access Denied Error
```python
requests.get('https://www.carwale.com/new/best-cars-under-30-lakh').text
```

---

### ‚ùå Why Do We Get 403 Error?

**Simple Reasons:**

- ü§ñ **Website thinks you're a bot** - Your request doesn't look like it's from a real browser
- üö® **Missing identification** - No User-Agent header to identify what's making the request
- üõ°Ô∏è **Security protection** - Websites block automated scripts to prevent abuse
- üë§ **Looks suspicious** - Python requests look different from human browser requests
- üîí **Anti-scraping measures** - Sites want to control who accesses their data

---

### üé≠ The Solution: Add Headers

**Make your request look human!**
```python
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36'
}

requests.get('https://www.carwale.com/new/best-cars-under-30-lakh', headers=headers).text
```

---

### üí° Key Takeaway

**Without headers** = ü§ñ "I'm a bot!"  
**With headers** = üë§ "I'm a regular Chrome browser user!"

---

In [12]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36'
}

webpage = requests.get('https://www.carwale.com/new/best-cars-under-30-lakh', headers=headers).text
webpage

'\n\n<!DOCTYPE html><html lang="en" itemscope itemtype="http://schema.org/WebPage"\n\tprefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#"><head><meta charset="utf-8" /><meta name="theme-color" content="#0e3a50"><meta name="robots" content="max-snippet:-1, max-image-preview:large" />\n\t<title data-react-helmet="true" itemprop="name">Best Cars Under 30 Lakh in India| Top Cars Below 30 Lakh - CarWale</title>\n\t\n<link href="https://imgd.aeplcdn.com" rel="preconnect dns-prefetch" crossorigin><link href="https://stc.aeplcdn.com" rel="preconnect dns-prefetch" crossorigin><link href="https://www.googletagmanager.com" rel="preconnect dns-prefetch" crossorigin><link href="https://www.googletagservices.com" rel="preconnect dns-prefetch" crossorigin><link href="https://bid.g.doubleclick.net" rel="preconnect dns-prefetch" crossorigin><link href="https://pagead2.googlesyndication.com" rel="dns-prefetch" crossorigin><link href="https://googleads.g.doubleclick.net" rel="dns-prefetch" crossorigi

### üç≤ Creating BeautifulSoup Object
```python
soup = BeautifulSoup(webpage, 'lxml')
soup
```

---

### ü§î What is Soup?

**Simple Explanation:**

- üìÑ **Soup = Organized HTML** - Takes messy HTML text and structures it
- üîç **Searchable Document** - Makes HTML easy to search and navigate
- üéØ **Data Extractor** - Helps you find specific elements (tags, classes, IDs)
- üß© **Parser Object** - Breaks down HTML into understandable parts
- üõ†Ô∏è **Scraping Tool** - Your main tool for extracting data from web pages

---

### üîß Breaking Down the Code

| Component | What It Does |
|-----------|--------------|
| `BeautifulSoup()` | Function that creates the soup object |
| `webpage` | Raw HTML text from the website |
| `'lxml'` | Parser that reads and structures the HTML |
| `soup` | Final object you can search and extract data from |

---

### üí° Think of Soup Like This:

**Raw HTML (webpage)** = üì¶ Messy box of ingredients  
**BeautifulSoup (soup)** = üç≤ Organized, ready-to-use meal  

Now you can easily "pick out" the data you need! üéØ

---

In [41]:
soup = BeautifulSoup(webpage, 'lxml')
print(soup)

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#"><head><meta charset="utf-8"/><meta content="#0e3a50" name="theme-color"/><meta content="max-snippet:-1, max-image-preview:large" name="robots"/>
<title data-react-helmet="true" itemprop="name">Best Cars Under 30 Lakh in India| Top Cars Below 30 Lakh - CarWale</title>
<link crossorigin="" href="https://imgd.aeplcdn.com" rel="preconnect dns-prefetch"/><link crossorigin="" href="https://stc.aeplcdn.com" rel="preconnect dns-prefetch"/><link crossorigin="" href="https://www.googletagmanager.com" rel="preconnect dns-prefetch"/><link crossorigin="" href="https://www.googletagservices.com" rel="preconnect dns-prefetch"/><link crossorigin="" href="https://bid.g.doubleclick.net" rel="preconnect dns-prefetch"/><link crossorigin="" href="https://pagead2.googlesyndication.com" rel="dns-prefetch"/><link crossorigin="" href="https://googleads.g.doubleclick.net" rel

### üé® Pretty Printing HTML
```python
soup.prettify()
```

---

### ü§î What Does `prettify()` Do?

**Simple Explanation:**

- üìê **Formats HTML neatly** - Adds proper indentation and line breaks
- üëÄ **Makes it readable** - Transforms messy HTML into organized structure
- üîç **Easy to understand** - Shows parent-child relationships clearly
- üìù **Debugging tool** - Helps you see the HTML structure at a glance
- ‚ú® **Beautifies code** - Converts single-line HTML into multi-line formatted text

---

### üìä Visual Comparison

**Before `prettify()`:**
```html
<html><head><title>Page</title></head><body><h1>Hello</h1><p>Text</p></body></html>
```

**After `prettify()`:**
```html
<html>
 <head>
  <title>
   Page
  </title>
 </head>
 <body>
  <h1>
   Hello
  </h1>
  <p>
   Text
  </p>
 </body>
</html>
```

---

### üí° When to Use It?

- ‚úÖ When you want to **see the HTML structure**
- ‚úÖ For **debugging** your scraping code
- ‚úÖ To **understand page layout** before extracting data

---

In [14]:
soup.prettify()

'<!DOCTYPE html>\n<html itemscope="" itemtype="http://schema.org/WebPage" lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="#0e3a50" name="theme-color"/>\n  <meta content="max-snippet:-1, max-image-preview:large" name="robots"/>\n  <title data-react-helmet="true" itemprop="name">\n   Best Cars Under 30 Lakh in India| Top Cars Below 30 Lakh - CarWale\n  </title>\n  <link crossorigin="" href="https://imgd.aeplcdn.com" rel="preconnect dns-prefetch"/>\n  <link crossorigin="" href="https://stc.aeplcdn.com" rel="preconnect dns-prefetch"/>\n  <link crossorigin="" href="https://www.googletagmanager.com" rel="preconnect dns-prefetch"/>\n  <link crossorigin="" href="https://www.googletagservices.com" rel="preconnect dns-prefetch"/>\n  <link crossorigin="" href="https://bid.g.doubleclick.net" rel="preconnect dns-prefetch"/>\n  <link crossorigin="" href="https://pagead2.googlesyndication.com" rel="dns-prefetch"/>\n  <link cross

### üéØ Extracting Car Names from H3 Tags
```python
soup.find_all('h3')
soup.find_all('h3')[0]
for i in soup.find_all('h3'):
    print(i.text.strip())
```

---

### üîç How Did We Know to Use `h3`?

**The Investigation Process:**

1. üåê **Visit the website** - Open the CarWale page
2. üñ±Ô∏è **Right-click on car name** - Select "Inspect" or "Inspect Element"
3. üìã **Check the Elements tab** - Browser DevTools opens
4. üîé **Examine each car card** - Look at the HTML structure
5. ‚úÖ **Found it!** - Car names are inside `<h3>` tags

---

### üõ†Ô∏è Code Breakdown

| Code | What It Does | Output |
|------|--------------|--------|
| `soup.find_all('h3')` | Gets **all** `<h3>` tags from page | List of all h3 elements |
| `soup.find_all('h3')[0]` | Gets **first** `<h3>` tag only | Single h3 element |
| `for i in soup.find_all('h3'):` | **Loops through** each h3 tag | Iterates over all h3s |
| `i.text.strip()` | Extracts **clean text** from tag | "Tata Nexon" |

---

### üìù What This Extracts

**Output Example:**
```
Tata Nexon
Hyundai Creta
Maruti Suzuki Grand Vitara
Kia Seltos
Honda City
...
```

Each line is a **car name** extracted from the `<h3>` tags! üöó

---

### üí° The Inspection Process
```
Website ‚Üí Inspect Element ‚Üí Find Pattern ‚Üí Use in Code
   üåê   ‚Üí       üîç        ‚Üí      <h3>    ‚Üí  find_all('h3')
```

**Why `h3` specifically?**  
Because that's where the **car names** are stored in the HTML structure! üéØ

---

### ‚úÖ Key Takeaway

Always **inspect the website first** to find the correct HTML tags before scraping! üîç‚ú®

---

In [19]:
soup.find_all('h3')
soup.find_all('h3')[0]
for i in soup.find_all('h3'):
  print(i.text.strip())

Jeep Compass
Hyundai Creta
Maruti Suzuki Victoris
Mahindra Scorpio N
Mahindra XUV700
Toyota Urban Cruiser Hyryder
Mahindra Thar Roxx
Mahindra BE 6
Kia Seltos
Tata Harrier
Toyota Innova Hycross
Toyota Innova Crysta
Mahindra Scorpio
Mahindra XEV 9e
Tata Safari
Tata Harrier EV
Top SUVs in India
Top Sedans in India
Top Hatchbacks in India
Top Compact SUVs in India
Top Luxury Cars in India


### üí∞ Extracting Average Ex-Showroom Price
```python
# Finding average ex-showroom price

# soup.find_all('span').text  ‚ùå This gives ERROR!
# Why? Because many elements use <span> tag, not just prices

# Solution: Use the specific CLASS of the price span

for i in soup.find_all('span', class_='o-j5 o-ji o-jJ o-js'):
    print(i.text.strip())
```

---

### ‚ùå Why `soup.find_all('span')` Fails?

**The Problem:**

- üè∑Ô∏è **Too many span tags** - Hundreds of `<span>` elements on the page
- üéØ **Not specific enough** - Can't distinguish price spans from others
- ‚ö†Ô∏è **Wrong data** - Gets random text, not just prices

---

### ‚úÖ Solution: Use `class_` Attribute
```python
soup.find_all('span', class_='o-j5 o-ji o-jJ o-js')
```

**What this does:**

- üéØ **Targets specific spans** - Only spans with this exact class
- üí∞ **Gets only prices** - Filters out unwanted span tags
- ‚ú® **Clean output** - Extracts just the ex-showroom prices

---

### üìù Output Example
```
‚Çπ 8.00 - 15.50 Lakh
‚Çπ 11.00 - 20.30 Lakh
‚Çπ 10.99 - 20.09 Lakh
‚Çπ 10.90 - 20.35 Lakh
...
```

---

### üí° Key Lesson

**Generic tag** (`span`) = ‚ùå Too broad  
**Tag + Class** (`span` with specific class) = ‚úÖ Precise targeting! üéØ

---

In [25]:
# finding average ex show room price

# soup.find_all('span').text # give error because not only showroom price used span tag
# more elements used this span tag therefore error

# therefore use class of the average ex show room price

for i in soup.find_all('span', class_ = 'o-j5 o-ji o-jJ o-js'):
  print(i.text.strip())

Rs. 17.73 Lakh
Rs. 10.73 Lakh
Rs. 10.50 Lakh
Rs. 13.20 Lakh
Rs. 13.66 Lakh
Rs. 10.95 Lakh
Rs. 12.25 Lakh
Rs. 18.90 Lakh
Rs. 10.79 Lakh
Rs. 14.00 Lakh
Rs. 18.86 Lakh
Rs. 18.66 Lakh
Rs. 12.98 Lakh
Rs. 21.90 Lakh
Rs. 14.66 Lakh
Rs. 21.49 Lakh


### üì¶ Fetching Complete Car Cards
```python
cars = soup.find_all('div', class_='o-bC o-aE o-f')
len(cars)
```

---

### üéØ Why Fetch Complete Containers?

**The Problem with Separate Extraction:**

- üîÄ **Data gets mixed up** - Can't tell which price belongs to which car
- üß© **No connection** - Car names and prices are separate lists
- ‚ùå **Matching nightmare** - Hard to pair data correctly

**The Container Solution:**

- üì¶ **One card = One car** - All data stays together
- ‚úÖ **Organized structure** - Price, mileage, features linked to specific car
- üéØ **Easy extraction** - Extract all details from single container

---

### üîç What This Code Does

| Code | Explanation |
|------|-------------|
| `soup.find_all('div', ...)` | Finds all `<div>` containers |
| `class_='o-bC o-aE o-f'` | Specific class of each car card |
| `cars` | List of all car card containers |
| `len(cars)` | Total number of car cards found |

---

### üìä Example Output
```python
len(cars)  # Output: 15
```

**Meaning:** Found **15 complete car cards** on the page! üöó

---

### üí° Next Step

Now you can loop through each `car` container and extract **all its data** together! üéâ

---

In [28]:
cars = soup.find_all('div', class_ = 'o-bC o-aE o-f')
len(cars)

15

### üóÇÔ∏è Extracting Data from Each Car Card
```python
carNames = []
exShowRoomPrices = []

for i in cars:
    # Extract car name
    name_tag = i.find('h3')
    if name_tag:
        carNames.append(name_tag.get_text(strip=True))
    else:
        carNames.append(None)

    # Extract price
    price_tag = i.find('span', class_='o-j5 o-ji o-jJ o-js')
    if price_tag:
        exShowRoomPrices.append(price_tag.get_text(strip=True))
    else:
        exShowRoomPrices.append(None)

print(exShowRoomPrices)
```

---

### üîÑ How This Works

**Step-by-Step Process:**

1. üìã **Create empty lists** - To store car names and prices
2. üîÅ **Loop through each car card** - Process one card at a time
3. üîç **Find name inside card** - Extract `<h3>` tag
4. üí∞ **Find price inside card** - Extract price span
5. ‚úÖ **Append to lists** - Add data or `None` if missing
6. üìä **Print results** - Display all extracted prices

---

### ‚ö° `find()` vs `find_all()` - Critical Difference!

| Method | Returns | Use Case | Example |
|--------|---------|----------|---------|
| `find()` | **First match only** | Get ONE element per card | `i.find('h3')` ‚Üí Single name |
| `find_all()` | **All matches (list)** | Get MULTIPLE elements | `i.find_all('span')` ‚Üí All spans |

---

### ü§î What If We Used `find_all()` Instead?
```python
# ‚ùå WRONG: Using find_all()
name_tag = i.find_all('h3')  # Returns a LIST [<h3>...</h3>]
carNames.append(name_tag.get_text())  # ERROR! Lists don't have .get_text()

# ‚úÖ CORRECT: Using find()
name_tag = i.find('h3')  # Returns SINGLE element <h3>...</h3>
carNames.append(name_tag.get_text())  # Works perfectly!
```

**Problem with `find_all()`:**
- üì¶ Returns a **list**, even if only one element exists
- ‚ùå Can't directly call `.get_text()` on a list
- üîÑ Would need extra loop: `for tag in name_tags: tag.get_text()`

---


### üí° Key Takeaway

**Use `find()`** when you need **ONE specific element** per container!  
**Use `find_all()`** when you need **MULTIPLE elements** from the same container! üéØ

---

In [38]:
carNames = []
exShowRoomPrices = []

for i in cars:
    # Extract car name
    name_tag = i.find('h3')
    if name_tag:
        carNames.append(name_tag.get_text(strip=True))
    else:
        carNames.append(None)

    # Extract price
    price_tag = i.find('span', class_='o-j5 o-ji o-jJ o-js')
    if price_tag:
        exShowRoomPrices.append(price_tag.get_text(strip=True))
    else:
        exShowRoomPrices.append(None)

print(exShowRoomPrices)
print(len(carNames) == len(exShowRoomPrices))


['Rs. 10.73 Lakh', 'Rs. 10.50 Lakh', 'Rs. 13.20 Lakh', 'Rs. 13.66 Lakh', 'Rs. 10.95 Lakh', 'Rs. 12.25 Lakh', 'Rs. 18.90 Lakh', 'Rs. 10.79 Lakh', 'Rs. 14.00 Lakh', 'Rs. 18.86 Lakh', 'Rs. 18.66 Lakh', 'Rs. 12.98 Lakh', 'Rs. 21.90 Lakh', 'Rs. 14.66 Lakh', 'Rs. 21.49 Lakh']
True


### üìä Creating a DataFrame
```python
df = pd.DataFrame({
    'name': carNames,
    'price': exShowRoomPrices
})
```

---

### üéØ What This Does

**Combines extracted data into a structured table:**

- üìã **DataFrame** - Pandas table structure for organized data
- üè∑Ô∏è **Columns** - 'name' and 'price' as column headers
- üìù **Rows** - Each car as a separate row
- üóÇÔ∏è **Organized** - Easy to view, analyze, and export data

---

### üìä DataFrame Structure

| Column | Data Source | Example Value |
|--------|-------------|---------------|
| **name** | `carNames` list | "Tata Nexon" |
| **price** | `exShowRoomPrices` list | "‚Çπ 8.00 - 15.50 Lakh" |

---

### üìù Output Preview
```
                        name                    price
0                 Tata Nexon     ‚Çπ 8.00 - 15.50 Lakh
1              Hyundai Creta    ‚Çπ 11.00 - 20.30 Lakh
2   Maruti Suzuki Grand Vitara  ‚Çπ 10.99 - 20.09 Lakh
3                 Kia Seltos    ‚Çπ 10.90 - 20.35 Lakh
...
```

---

### ‚úÖ Data Ready!

Your scraped car data is now in a **clean, structured format** ready for analysis! üöóüìà

---

In [44]:
df = pd.DataFrame({
    'name' : carNames,
    'price': exShowRoomPrices
})

df

Unnamed: 0,name,price
0,Hyundai Creta,Rs. 10.73 Lakh
1,Maruti Suzuki Victoris,Rs. 10.50 Lakh
2,Mahindra Scorpio N,Rs. 13.20 Lakh
3,Mahindra XUV700,Rs. 13.66 Lakh
4,Toyota Urban Cruiser Hyryder,Rs. 10.95 Lakh
5,Mahindra Thar Roxx,Rs. 12.25 Lakh
6,Mahindra BE 6,Rs. 18.90 Lakh
7,Kia Seltos,Rs. 10.79 Lakh
8,Tata Harrier,Rs. 14.00 Lakh
9,Toyota Innova Hycross,Rs. 18.86 Lakh


In [45]:
df.shape

(15, 2)

In [50]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36'
}

carNames = []
exShowRoomPrices = []

# Loop over 7 pages
for i in range(1, 8):
    url = f'https://www.carwale.com/new/best-cars-under-30-lakh/{i}/'
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')

    cars = soup.find_all('div', class_ = 'o-bC o-aE o-f')

    for j in cars :
      name_tag = j.find('h3')
      if name_tag:
          carNames.append(name_tag.get_text(strip=True))
      else:
          carNames.append(None)

      # Extract price
      price_tag = j.find('span', class_='o-j5 o-ji o-jJ o-js')
      if price_tag:
          exShowRoomPrices.append(price_tag.get_text(strip=True))
      else:
          exShowRoomPrices.append(None)

df = pd.DataFrame({
    'name' : carNames,
    'price' : exShowRoomPrices
})

df


Unnamed: 0,name,price
0,Hyundai Creta,Rs. 10.73 Lakh
1,Maruti Suzuki Victoris,Rs. 10.50 Lakh
2,Mahindra Scorpio N,Rs. 13.20 Lakh
3,Mahindra XUV700,Rs. 13.66 Lakh
4,Toyota Urban Cruiser Hyryder,Rs. 10.95 Lakh
...,...,...
97,Renault Kwid,Rs. 4.30 Lakh
98,Maruti Suzuki Celerio,Rs. 4.70 Lakh
99,Maruti Suzuki S-Presso,Rs. 3.50 Lakh
100,Maruti Suzuki Eeco,Rs. 5.21 Lakh
