# Web Scraping Tutorial

This notebook provides a step-by-step guide to scrape data from a website. Web scraping is a technique used to extract information from websites by transforming the data on web pages into a structured format. This is particularly useful for data analysis, machine learning, and other data-driven tasks.

In this tutorial, we will walk through the process of scraping product information from a sample e-commerce site. By following these steps, you will learn how to:

1. Send HTTP requests to retrieve web pages.
2. Parse HTML content using BeautifulSoup.
3. Identify and extract relevant data elements from the parsed HTML.
4. Store the extracted data in a structured format using pandas.
5. Save the data to a CSV file.
6. Optionally, save the data to a database such as MongoDB.

The website we will be scraping is [ScrapeMe](https://scrapeme.live/shop/). This site is designed for practice purposes and contains a variety of products with details such as names and prices, which makes it an ideal candidate for learning web scraping techniques.

Before you begin, please visit the site to understand its structure. This will help you identify the elements you need to scrape.

Let's get started!

In [None]:
! pip install requests
! pip install beautifulsoup4
! pip install pymongo

Collecting pymongo
  Downloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.6.1-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.6.1-py3-none-any.whl (307 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.7/307.7 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.6.1 pymongo-4.8.0


## Import libraries here

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Imports for saving data to database
import pymongo
from pymongo import MongoClient
from pymongo.server_api import ServerApi
import os

## Step 1: Send a request to the website

In [None]:
# The url we want to scrap
url = "https://scrapeme.live/shop/"

res = requests.get(url)

# Check the response is ok
print(res.ok)

# Print the first 1000 characters of the content
print(res.text[:1000])

True

<!doctype html>
<html lang="en-GB">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=2.0">
<link rel="profile" href="http://gmpg.org/xfn/11">
<link rel="pingback" href="https://scrapeme.live/xmlrpc.php">

<title>Products &#8211; ScrapeMe</title>
<link rel='dns-prefetch' href='//fonts.googleapis.com' />
<link rel='dns-prefetch' href='//s.w.org' />
<link rel="alternate" type="application/rss+xml" title="ScrapeMe &raquo; Feed" href="https://scrapeme.live/feed/" />
<link rel="alternate" type="application/rss+xml" title="ScrapeMe &raquo; Comments Feed" href="https://scrapeme.live/comments/feed/" />
<link rel="alternate" type="application/rss+xml" title="ScrapeMe &raquo; Products Feed" href="https://scrapeme.live/shop/feed/" />
		<script type="text/javascript">
			window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/11\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/11\/svg\

## Step 2: Parse the HTML content of the page

In [None]:
soup = BeautifulSoup(res.text, "html.parser")


# Print the page title:
print(soup.title)


for input in soup.find_all("input"):
  print(input)

<title>Products – ScrapeMe</title>
<input class="search-field" id="woocommerce-product-search-field-0" name="s" placeholder="Search products…" type="search" value=""/>
<input name="post_type" type="hidden" value="product"/>
<input name="paged" type="hidden" value="1"/>
<input name="paged" type="hidden" value="1"/>
<input class="search-field" name="s" placeholder="Search …" type="search" value=""/>
<input class="search-submit" type="submit" value="Search"/>
<input class="search-field" id="woocommerce-product-search-field-1" name="s" placeholder="Search products…" type="search" value=""/>
<input name="post_type" type="hidden" value="product"/>


## Step 3: Inspect the website and identify the elements to scrape
Inspect the website and identify the elements (e.g., product names, prices, etc.).

In [None]:
products = soup.find(name="ul", attrs={
    "class": "products"
})


for product in products:
  print(product)




<li class="post-759 product type-product status-publish has-post-thumbnail product_cat-pokemon product_cat-seed product_tag-bulbasaur product_tag-overgrow product_tag-seed first instock sold-individually taxable shipping-taxable purchasable product-type-simple">
<a class="woocommerce-LoopProduct-link woocommerce-loop-product__link" href="https://scrapeme.live/shop/Bulbasaur/"><img alt="" class="attachment-woocommerce_thumbnail size-woocommerce_thumbnail wp-post-image" height="324" sizes="(max-width: 324px) 100vw, 324px" src="https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png" srcset="https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png 350w, https://scrapeme.live/wp-content/uploads/2018/08/001-150x150.png 150w, https://scrapeme.live/wp-content/uploads/2018/08/001-300x300.png 300w, https://scrapeme.live/wp-content/uploads/2018/08/001-100x100.png 100w, https://scrapeme.live/wp-content/uploads/2018/08/001-250x250.png 250w, https://scrapeme.live/wp-content/uploa

## Step 4: Extract the desired data

In [None]:
# Print some of the products
for product in products:
  if product.find("h2") != -1:
    print("Product Info:")
    print("Name:", product.find("h2").text)
    print("Price:", product.find("span").text)
    print("--------------------------")

Product Info:
Name: Bulbasaur
Price: £63.00
--------------------------
Product Info:
Name: Ivysaur
Price: £87.00
--------------------------
Product Info:
Name: Venusaur
Price: £105.00
--------------------------
Product Info:
Name: Charmander
Price: £48.00
--------------------------
Product Info:
Name: Charmeleon
Price: £165.00
--------------------------
Product Info:
Name: Charizard
Price: £156.00
--------------------------
Product Info:
Name: Squirtle
Price: £130.00
--------------------------
Product Info:
Name: Wartortle
Price: £123.00
--------------------------
Product Info:
Name: Blastoise
Price: £76.00
--------------------------
Product Info:
Name: Caterpie
Price: £73.00
--------------------------
Product Info:
Name: Metapod
Price: £148.00
--------------------------
Product Info:
Name: Butterfree
Price: £162.00
--------------------------
Product Info:
Name: Weedle
Price: £25.00
--------------------------
Product Info:
Name: Kakuna
Price: £148.00
--------------------------
Product 

## Step 5: Create a DataFrame to store the extracted data

In [None]:
products_dict = {
    "product_name": [],
    "product_price": [],
}

for product in products:
  if product.find("h2") != -1:
    name = product.find("h2").text
    price = product.find("span").text
    products_dict["product_name"].append(name)
    products_dict["product_price"].append(price)

products_df = pd.DataFrame(data=products_dict)

## Step 6: Save the data to a CSV file

In [None]:
products_df.to_csv("products.csv")

## Step 7: Print the DataFrame to verify the extracted data

In [None]:

products_df.head()


Unnamed: 0,product_name,product_price
0,Bulbasaur,£63.00
1,Ivysaur,£87.00
2,Venusaur,£105.00
3,Charmander,£48.00
4,Charmeleon,£165.00


## Step 8: Save the data to a database of your choice. If you are using MongoDB, include the code here.

### Connecting to mongo

In [None]:

uri = "MONGO_CONNECTION_STRING"

# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))

# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)




Pinged your deployment. You successfully connected to MongoDB!


In [None]:
# Connect to MongoDB

client = MongoClient("MONGO_CONNECTION_STRING")

# Create a database
db = client['T5Bootcamp']

# Create a collection
scrapeme_coll = db['web_scraping_']

### Store data to mongodb

In [None]:
# Converting the dataframe to an array of dictionary so we can then store them in the database
products_in_dict = products_df.to_dict("records")
scrapeme_coll.insert_many(products_in_dict)

InsertManyResult([ObjectId('66b096bab20a795fcbac00a8'), ObjectId('66b096bab20a795fcbac00a9'), ObjectId('66b096bab20a795fcbac00aa'), ObjectId('66b096bab20a795fcbac00ab'), ObjectId('66b096bab20a795fcbac00ac'), ObjectId('66b096bab20a795fcbac00ad'), ObjectId('66b096bab20a795fcbac00ae'), ObjectId('66b096bab20a795fcbac00af'), ObjectId('66b096bab20a795fcbac00b0'), ObjectId('66b096bab20a795fcbac00b1'), ObjectId('66b096bab20a795fcbac00b2'), ObjectId('66b096bab20a795fcbac00b3'), ObjectId('66b096bab20a795fcbac00b4'), ObjectId('66b096bab20a795fcbac00b5'), ObjectId('66b096bab20a795fcbac00b6'), ObjectId('66b096bab20a795fcbac00b7')], acknowledged=True)