# **WEB SCRAPING**

<p><h3>Web scraping is a powerful technique used for extracting data from websites. Here, we will use some popular libraries for web scraping like, <code>requests</code>, <code>BeautifulSoup</code>, and <code>selenium</code>.</h3></p>

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

import datetime

import smtplib #This library is used to send an email

<p><h3>The web scraping is being performed on Address Guru website, link in <code>baseurl</code> </h3><p>

In [2]:
baseurl = 'https://www.addressguru.in/'

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0"
}

## *Scraping the Cafe & Restaurant Category*

In [3]:
r =requests.get('https://www.addressguru.in/Cafe-&-Restaurants/Dehradun/MTM=')

soupy = BeautifulSoup(r.content, 'lxml')
soup = BeautifulSoup(soupy.prettify(), 'lxml')

product_list = soup.find_all('div', class_ = 'search-top')

print(len(product_list))

23


In [20]:
product_link = []

#Range of 3 pages
for x in range(1,3):
    #Format with a f-string
    r = requests.get(f"https://www.addressguru.in/Cafe-&-Restaurants/Dehradun/MTM=?page={x}", headers = headers)
    
    soupy = BeautifulSoup(r.content, 'lxml')
    soup = BeautifulSoup(soupy.prettify(), 'lxml')
    
    product_list = soup.find_all('a', class_ = 'search-heading')   
    urls = [a['href'] for a in product_list]    # getting urls from href
    #print(urls)       

    # for item in product_list:
    #     for link in item.find_all('a', href = True):
    #         product_link.append(link['href'])

print(len(urls))

20


### **Test Page**

In [21]:
testlink = 'https://www.addressguru.in/my-wife-s-place'

r = requests.get(testlink, headers = headers)

soup1 = BeautifulSoup(r.content, 'lxml')
soup2 = BeautifulSoup(soup1.prettify(), 'lxml')

name = soup2.find('h1', style = "margin-top:10px;font-size:25px;").get_text().strip()

rating = soup2.find('span', style = "font-size:16px!important;").get_text().strip()[:1]

review = soup2.find('span', style = "font-size:16px!important;").get_text().strip()[-13:]

day = datetime.date.today()

CafeAndRestaurant = {
    'name': name,
    'rating': rating,
    'review': review,
    'date': day
}

print(name, rating, review, day)

My Wife's Place 5 ( 2 Reviews ) 2023-10-01


## **Finalise Category Scraping**

<p>Collecting all data through the product links and converting into a <code>DataFrame</code> and saving as a spreadsheet file.</p>

In [18]:
urls

['https://www.addressguru.in/kabila-restaurant-dehradun',
 'https://www.addressguru.in/sethis-food-corner-sfc-dehradun',
 'https://www.addressguru.in/noodle-house-restaurant-n-cafe',
 'https://www.addressguru.in/y-cafe-restaurant',
 'https://www.addressguru.in/ambrosia-restaurant-dehradun',
 'https://www.addressguru.in/south-indian-vibes-dehradun',
 'https://www.addressguru.in/desi-chulha-dehradun',
 'https://www.addressguru.in/my-wife-s-place',
 'https://www.addressguru.in/black-pepper-restaurant',
 'https://www.addressguru.in/uttarakhand-food-junction',
 'https://www.addressguru.in/uss-da-dhaba-restaurant-dehradun',
 'https://www.addressguru.in/mamma-s-taste-restaurant',
 'https://www.addressguru.in/tirupati-restaurant-dehradun',
 'https://www.addressguru.in/the-northeast-hut-dehradun',
 'https://www.addressguru.in/all-day-breakfast-dehradun',
 'https://www.addressguru.in/the-punjabi-essence-dehradun',
 'https://www.addressguru.in/babagz',
 'https://www.addressguru.in/prez-restaurant

In [24]:
#CafeAndRestaurant = []

for link in urls:
    r = requests.get(link, headers = headers)

    soup1 = BeautifulSoup(r.content, 'lxml')
    soup2 = BeautifulSoup(soup1.prettify(), 'lxml')

    name = soup2.find('h1', style = "margin-top:10px;font-size:25px;").get_text().strip()

    rating = soup2.find('span', style = "font-size:16px!important;").get_text().strip()[:1]

    review = soup2.find('span', style = "font-size:16px!important;").get_text().strip()[-13:]
    
    day = datetime.date.today()

    CafeAndRestaurant = {
        'name': name,
        'rating': rating,
        'review': review,
        'date': day
    }
    
    print(CafeAndRestaurant)

{'name': 'Tirupati Restaurant Dehradun', 'rating': '4', 'review': '( 3 Reviews )', 'date': datetime.date(2023, 10, 1)}
{'name': 'Tavern Restaurant', 'rating': '5', 'review': '( 1 Reviews )', 'date': datetime.date(2023, 10, 1)}
{'name': 'Farm To Fork Restaurant', 'rating': '5', 'review': '( 1 Reviews )', 'date': datetime.date(2023, 10, 1)}
{'name': 'Black Pepper Restaurant', 'rating': '5', 'review': '( 2 Reviews )', 'date': datetime.date(2023, 10, 1)}
{'name': 'The Punjabi Essence Dehradun', 'rating': '5', 'review': '( 1 Reviews )', 'date': datetime.date(2023, 10, 1)}
{'name': "My Wife's Place", 'rating': '5', 'review': '( 2 Reviews )', 'date': datetime.date(2023, 10, 1)}
{'name': 'BabaGz', 'rating': '(', 'review': '( 0 Reviews )', 'date': datetime.date(2023, 10, 1)}
{'name': 'Punjab Restaurant Dehradun', 'rating': '5', 'review': '( 1 Reviews )', 'date': datetime.date(2023, 10, 1)}
{'name': 'Y Cafe & Restaurant', 'rating': '5', 'review': '( 1 Reviews )', 'date': datetime.date(2023, 10, 