# 🌐 Web Scraping: A Simple Guide for Beginners 🚀

![image.png](attachment:image.png)

<div style="box-shadow: 5px 5px 10px rgba(0, 0, 0, 0.1); border: 2px solid #3498DB; border-radius: 10px; padding: 20px; text-align: center;">

<h3 style="color:#D35400; font-weight:bold;">ABOUT THE AUTHOR</h3>

I am **Zeeshan Younas**, a passionate data scientist dedicated to mastering machine learning techniques and continually expanding my knowledge base. I believe in #KeepLearning and #KeepSupporting to keep growing and helping others in the field! 🌟.

<img src="https://media.licdn.com/dms/image/D4D03AQG5iDKRRJFsCQ/profile-displayphoto-shrink_200_200/0/1714460771099?e=1720051200&v=beta&t=v8MfJW0-fdwbuOjHePjBMSdor0Nq5PhBhrpAtXhljlk" alt="Profile Picture" style="width: 100px; height: 100px; border-radius: 50%; margin: 10px 0;">

You can find more about me on my [GitHub](https://github.com/Zeeshan5932/project) and [LinkedIn](https://www.linkedin.com/in/zeeshan-younas-919a09253/).

Feel free to connect and reach out for any collaboration or queries!

</div>


### **What is Web Scraping?**

**Web scraping** is an automated method used to extract large amounts of data from websites. It involves fetching the web pages and parsing the content to retrieve specific information.

### **Why Do We Use Web Scraping?**

1. <span style="color: #3366ff">**Data Collection**</span>: To gather data from various websites for research, analysis, or integration into databases.
2. <span style="color: #3366ff">**Market Research**</span>: To monitor competitor prices, product details, and market trends.
3. <span style="color: #3366ff">**Content Aggregation**</span>: To compile information from multiple sources into one place, such as news, blogs, or reviews.
4. <span style="color: #3366ff">**Lead Generation**</span>: To collect contact information and details for potential customers.
5. <span style="color: #3366ff">**Job Listings**</span>: To compile job postings and trends from various employment websites.
6. <span style="color: #3366ff">**E-commerce**</span>: To track product availability, prices, and reviews across different online stores.




### **Pros of Web Scraping:**

1. **Data Collection**: Efficiently gather large volumes of data from multiple sources.
   
2. **Market Research**: Monitor competitor activities, prices, and trends.
   
3. **Content Aggregation**: Compile information for analysis or integration into applications.
   
4. **Automation**: Perform repetitive tasks quickly and accurately.
   
5. **Real-Time Data**: Access up-to-date information for timely decision-making.

### **Cons of Web Scraping:**

1. **Legal Issues**: Can violate terms of service and copyright laws.
   
2. **Ethical Concerns**: Data privacy and usage concerns, especially with personal information.
   
3. **Technical Challenges**: Websites may use anti-scraping techniques like CAPTCHA or IP blocking.
   
4. **Maintenance**: Need to regularly update scrapers to adapt to website changes.
   
5. **Data Quality**: Scraped data may require cleaning and validation.

---------------------------------------------------------

### **Legal Methods:**

1. **Publicly Available Data**: Scraping data that is publicly accessible and not explicitly prohibited by the website's terms of service.

2. **APIs**: Using APIs (Application Programming Interfaces) provided by websites for data access, where available and permitted.

3. **Consent**: Obtaining explicit consent from website owners to scrape their data, often through formal agreements or licenses.

### **Illegal Methods:**

1. **Terms of Service Violation**: Scraping data from websites that explicitly prohibit scraping in their terms of service.

2. **Copyright Violation**: Scraping copyrighted content without permission or fair use justification.

3. **Unauthorized Access**: Bypassing security measures or accessing non-public areas of a website.

4. **Personal Data**: Scraping personally identifiable information (PII) without consent, which may violate data protection laws.

5. **Competitive Harm**: Scraping data for competitive advantage in a manner that harms the original website or its users.



# Import Libraries

`requests`: This library allows you to send HTTP requests easily. It's used here to fetch the HTML content of web pages.

`BeautifulSoup:` This is a library for parsing HTML and XML documents. It helps extract data from HTML into a structured format that you can work with programmatically.

`pandas: `This library provides data structures and tools for data analysis and manipulation. It's commonly used to organize and analyze the data extracted during web scraping.

In [26]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [11]:
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers/tablets"
r = requests.get(url)
# print(r)

The line `soup = BeautifulSoup(r.text, "lxml")` uses the `BeautifulSoup` constructor to create a BeautifulSoup object named `soup`. 

In summary, `soup = BeautifulSoup(r.text, "lxml")` initializes a BeautifulSoup object (`soup`) that parses the raw HTML content (`r.text`) using the `lxml` parser, allowing you to interact with and extract data from the HTML structure of the webpage.

In [12]:
soup = BeautifulSoup(r.text, "lxml")
# print(soup)

In [15]:
boxes = soup.find_all("div" , class_="col-md-4 col-xl-4 col-lg-4")
print(len(boxes))

21


In [18]:
names = soup.find_all("a" , class_ = "title")
names

[<a class="title" href="/test-sites/e-commerce/allinone/product/10" title="Lenovo IdeaTab">Lenovo IdeaTab</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/15" title="IdeaTab A3500L">IdeaTab A3500L</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/11" title="Acer Iconia">Acer Iconia</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/18" title="Galaxy Tab 3">Galaxy Tab 3</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/27" title="Iconia B1-730HD">Iconia B1-730H...</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/23" title="Memo Pad HD 7">Memo Pad HD 7</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/12" title="Asus MeMO Pad">Asus MeMO Pad</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/13" title="Amazon Kindle">Amazon Kindle</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/22" title="Galaxy Tab 3">Galaxy Tab 3</a>,
 <a class="title"

In [19]:
# Creaete loop to show names
for name in names:
    print(name.text)

Lenovo IdeaTab
IdeaTab A3500L
Acer Iconia
Galaxy Tab 3
Iconia B1-730H...
Memo Pad HD 7
Asus MeMO Pad
Amazon Kindle
Galaxy Tab 3
IdeaTab A8-50
MeMO Pad 7
IdeaTab A3500-...
IdeaTab S5000
Galaxy Tab 4
Galaxy Tab
MeMo PAD FHD 1...
Galaxy Note
Galaxy Note
iPad Mini Reti...
Galaxy Note 10...
Apple iPad Air


In [21]:
prices = soup.find_all("h4" , class_ = "price float-end card-title pull-right")

for price in prices:
    print(price.text)

$69.99
$88.99
$96.99
$97.99
$99.99
$101.99
$102.99
$103.99
$107.99
$121.99
$130.99
$148.99
$172.99
$233.99
$251.99
$320.99
$399.99
$489.99
$537.99
$587.99
$603.99


In [28]:
review = soup.find_all("p" , class_ = "review-count float-end")

for r in review:
    print(r.text)

7 reviews
7 reviews
7 reviews
2 reviews
1 reviews
10 reviews
14 reviews
3 reviews
14 reviews
13 reviews
11 reviews
9 reviews
8 reviews
1 reviews
14 reviews
7 reviews
12 reviews
9 reviews
8 reviews
6 reviews
7 reviews


In [23]:
description = soup.find_all("p" , class_ = "description card-text")

for d in description:
    print(d.text)

7" screen, Android
Black, 7" IPS, Quad-Core 1.2GHz, 8GB, Android 4.2
7" screen, Android, 16GB
7", 8GB, Wi-Fi, Android 4.2, White
Black, 7", 1.6GHz Dual-Core, 8GB, Android 4.4
IPS, Dual-Core 1.2GHz, 8GB, Android 4.3
7" screen, Android, 8GB
6" screen, wifi
7", 8GB, Wi-Fi, Android 4.2, Yellow
Blue, 8" IPS, Quad-Core 1.3GHz, 16GB, Android 4.2
White, 7", Atom 1.2GHz, 8GB, Android 4.4
Blue, 7" IPS, Quad-Core 1.3GHz, 8GB, 3G, Android 4.2
Silver, 7" IPS, Quad-Core 1.2Ghz, 16GB, 3G, Android 4.2
LTE (SM-T235), Quad-Core 1.2GHz, 8GB, Black
16GB, White
White, 10.1" IPS, 1.6GHz, 2GB, 16GB, Android 4.2
10.1", 3G, Android 4.0, Garnet Red
12.2", 32GB, WiFi, Android 4.4, White
Wi-Fi + Cellular, 32GB, Silver
10.1", 32GB, Black
Wi-Fi, 64GB, Silver


In [30]:
# make data
row = []
for i in range(len(names)):
    row.append([names[i].text , prices[i].text , review[i].text , description[i].text])

df = pd.DataFrame(row , columns = ["Name" , "Price" , "Review" , "Description"])
df


Unnamed: 0,Name,Price,Review,Description
0,Lenovo IdeaTab,$69.99,7 reviews,"7"" screen, Android"
1,IdeaTab A3500L,$88.99,7 reviews,"Black, 7"" IPS, Quad-Core 1.2GHz, 8GB, Android 4.2"
2,Acer Iconia,$96.99,7 reviews,"7"" screen, Android, 16GB"
3,Galaxy Tab 3,$97.99,2 reviews,"7"", 8GB, Wi-Fi, Android 4.2, White"
4,Iconia B1-730H...,$99.99,1 reviews,"Black, 7"", 1.6GHz Dual-Core, 8GB, Android 4.4"
5,Memo Pad HD 7,$101.99,10 reviews,"IPS, Dual-Core 1.2GHz, 8GB, Android 4.3"
6,Asus MeMO Pad,$102.99,14 reviews,"7"" screen, Android, 8GB"
7,Amazon Kindle,$103.99,3 reviews,"6"" screen, wifi"
8,Galaxy Tab 3,$107.99,14 reviews,"7"", 8GB, Wi-Fi, Android 4.2, Yellow"
9,IdeaTab A8-50,$121.99,13 reviews,"Blue, 8"" IPS, Quad-Core 1.3GHz, 16GB, Android 4.2"


In [31]:
# Save csv file
df.to_csv("tablets data.csv")