# Web Scraping

Web scraping is a technique used to automatically extract large amounts of data from websites. It involves fetching the web pages and parsing the content to retrieve specific information, such as text, images, or links. This process is often done using programming languages like Python, with libraries such as BeautifulSoup or Scrapy, which help navigate the HTML structure of web pages. Web scraping is useful for tasks like data analysis, price comparison, and gathering information for research, but it's important to respect the website's terms of service and ensure that scraping does not overload the server or violate any legal restrictions.

In [12]:
import requests
response = requests.get("http://api.github.com/events")
# print(response)
# print(response.status_code)
if(response.status_code == 200 ):
    print("successfully fetch the page")
    # print(response.headers)
    # print(response.headers['Content-type'])
    # to print content as a text 
    # print(response.text)
    
    #if the content type json print the content
    if 'application/json' in response.headers['Content-type']:
        data = response.json()
        print(data)
    else:
        print("some error happened while fetching the page")

        



successfully fetch the page


In [13]:
# This code snippet demonstrates how to send a POST request using the `requests` library in Python. 
# It first imports the `requests` module and then defines a dictionary `data` containing login credentials. 
# The URL for the POST request is set to 'https://httpbin.org/post'. 
# The `requests.post` function is used to send the POST request with the specified URL and data. 
# The response from the server is stored in the `response` variable. 
# The code then checks if the status code of the response is 200, indicating a successful request. 
# If the login is successful, it prints "login success" and the response text. 
# Otherwise, it prints "login failed" along with the status code.
import requests
data = {'username':'user', 'password':'pass'}
url = 'https://httpbin.org/post'
response = requests.post(url, data)
if response.status_code == 200 :
    print("login success")
    print(response.text)
else :
    print("login failed", response.status_code)    

login success
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "password": "pass", 
    "username": "user"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "27", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.32.3", 
    "X-Amzn-Trace-Id": "Root=1-677f52c4-7e171bfc5da05ebe14d735f7"
  }, 
  "json": null, 
  "origin": "61.245.170.195", 
  "url": "https://httpbin.org/post"
}



In [9]:
# sending a GET request with query parameters
import requests 
url = 'https://example.com'
params = {'q': 'python', 'Category': 'books'}
response = requests.get(url )
if response.status_code == 200:
    print(response.url)
    print(response.text)
else:
    print('Request failed:', response.status_code)    

https://example.com/?q=python&Category=books
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>


In [None]:

# sending a GET request with basic authentication

import requests

url = 'https://httpbin.org/get'

user_name = 'user'
password = 'pass'

response = requests.get(url, auth=(user_name, password))

if response.status_code == 200:
    print('Request successful')
    print(response.text)
else:
    print('Request failed:', response.status_code)    

Request successful
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Authorization": "Basic dXNlcjpwYXNz", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.32.3", 
    "X-Amzn-Trace-Id": "Root=1-677f5ba8-3caa7545342b14cf08f93694"
  }, 
  "origin": "61.245.170.195", 
  "url": "https://httpbin.org/get"
}



In [22]:
# scraping a webpage using BeautifulSoup
from bs4 import BeautifulSoup
import requests

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie&quot; class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie&quot; class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie&quot; class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(
    html_doc,
    'html.parser'
)

print(soup.title) # The Dormouse's story
print(soup.p)
print(soup.a)
p_tag = soup.find('p')
print(p_tag)

<title>The Dormouse's story</title>
<p class="title"><b>The Dormouse's story</b></p>
<a href='http://example.com/elsie" class=' id="link1" sister"="">Elsie</a>
<p class="title"><b>The Dormouse's story</b></p>


In [51]:
# web scraping exercise
import requests
from bs4 import BeautifulSoup

url = 'https://example.com'
response = requests.get(url)

if response.status_code == 200:
    if response.text:
        soup = BeautifulSoup(response.text, 'html.parser')
        if soup.h1:
            print(soup.h1.text)
        else:
            print('Title not found')
    else:
        print('No content found')        
else:
    print('Request failed:', response.status_code)


Example Domain
