### Documentation of the Web Scraping Task By 
## Md Ayyan - mdayyan698@gmail.com  - 8130510698

1. **Task Overview**: Scrape clinic data (name, address, contact number, website) from Yellow Pages for Queensland general practices.

2. Imported `selenium`, `pandas`. Initialized Edge WebDriver.

3.  Used  Yellow Pages search page for general practice clinics in Queensland.

4. Created empty lists: `Clinic_name`, `Address`, `Website`, `Contact_number`.

5.  Loop through the first 35 pages; used `WebDriverWait` to ensure elements loaded.

6. Extracted:
   - **Clinic Name**: From `h3` tags.
   - **Address**: From `p` tags.
   - **Contact Number**: From anchor tags with "tel:" links.
   - **Website**: From anchor tags with website links.

7.  Implemented logic to navigate to the next page, handling button clicks.

8. **Error Handling**: Managed exceptions for window closures and timeouts.

9.  Compiled the scraped data into a pandas DataFrame.

10.  Used `pd.DataFrame.to_excel()` to save the DataFrame as an Excel file named ----QNS_Clinic detail.xlsx

11  Ensured WebDriver closed properly in the `finally` block.



 ## `**Challenges**`
 : ---Encountered dynamic content loading --> (handle by easily shifting to Selenium from beutiful Soup)
 
 selector changes --(auto generation of class name), 
 
 and pagination issues in finding exact tag , requiring careful handling.

In [178]:
pip install xlsxwriter


Collecting xlsxwriter
  Downloading XlsxWriter-3.2.0-py3-none-any.whl.metadata (2.6 kB)
Downloading XlsxWriter-3.2.0-py3-none-any.whl (159 kB)
Installing collected packages: xlsxwriter
Successfully installed xlsxwriter-3.2.0
Note: you may need to restart the kernel to use updated packages.


In [134]:
pip install openpyxl


Note: you may need to restart the kernel to use updated packages.


In [1]:
from selenium.webdriver.edge.service import Service
from selenium.webdriver.edge.options import Options


In [2]:
edge_options = Options()
edge_options.add_argument("start-maximized")  

In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
from selenium.common.exceptions import NoSuchWindowException, TimeoutException


In [4]:
service = Service("c:\\Users\\mdayy\\OneDrive\Desktop\\edgedriver_2792win64\\msedgedriver.exe")
driver = webdriver.Edge(service=service, options=edge_options)


  service = Service("c:\\Users\\mdayy\\OneDrive\Desktop\\edgedriver_2792win64\\msedgedriver.exe")


In [18]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd

driver = webdriver.Edge()

try:
    driver.get('https://www.yellowpages.com.au/search/listings?clue=General+practice+clinics%2Fmedical+centres&locationClue=Queensland')
    
    Clinic_name = []
    Address = []
    Website = []
    Contact_number = []

    for page in range(1, 36):  # Loop through pages 1 to 35
        print(f"Scraping page {page}...")
        WebDriverWait(driver, 30).until(
            EC.presence_of_element_located((By.TAG_NAME, "a"))
        )
        
        html = driver.page_source
        soup = BeautifulSoup(html, 'html.parser')
        
        # For Name ---------------------------------------
        div_b = soup.find_all('div', class_="Box__Div-sc-dws99b-0 dAyAhR")
        for div in div_b:
            a_tag = div.find('h3')
            if a_tag:
                Clinic_name.append(a_tag.text.strip())  
            else:
                Clinic_name.append("Not available")

        # For Address ---------------------------------------
        for div in div_b:
            a_tag = div.find('p')
            if a_tag:
                Address.append(a_tag.text.strip())  
            else:
                Address.append("Not available")

        # For Contact ---------------------------------------
        div_Contact = soup.find_all('div', class_="Box__Div-sc-dws99b-0 enijwQ MuiCardContent-root")
        for div in div_Contact:
            a_tag = div.find('a', class_='MuiButtonBase-root MuiButton-root MuiButton-text ButtonPhone wobble-call MuiButton-textPrimary MuiButton-fullWidth')
            if a_tag and a_tag.get('href'):
                Contact_number.append(a_tag['href']) 
            else:
                Contact_number.append("Not available") 

        # For Website ---------------------------------------
        div_tags = soup.find_all('div', class_="Box__Div-sc-dws99b-0 enijwQ MuiCardContent-root")
        for div in div_tags:
            a_tag = div.find('a', class_='MuiButtonBase-root MuiButton-root MuiButton-text ButtonWebsite jss367 MuiButton-textPrimary MuiButton-fullWidth')
            if a_tag and a_tag.get('href'):
                Website.append(a_tag.get('href'))  
            else:
                Website.append("Not available")

        # Navigate to the next page
        try:
            next_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.MuiButtonBase-root.MuiButton-root.MuiButton-outlined.MuiButton-fullWidth')))           
            next_button.click()
        except Exception as e:
            print("No more pages or an error occurred while trying to click next:", e)
            break

    # Create a DataFrame and print the results              
    df = pd.DataFrame({
        'Clinic_name': Clinic_name,
        'Address': Address,
        'Contact_number': Contact_number,
        'Website': Website
    })   

    

except NoSuchWindowException:
    print("The window was closed unexpectedly. Attempting to recover...")
    driver = webdriver.Edge()
    driver.get('https://www.yellowpages.com.au/search/listings?clue=General+practice+clinics%2Fmedical+centres&locationClue=Queensland')

except TimeoutException:
    print("The page took too long to load or elements were not found in the expected time.")

except Exception as e:
    print(f"An error occurred: {e}")

finally:
    driver.quit()


Scraping page 1...
Scraping page 2...
Scraping page 3...
Scraping page 4...
Scraping page 5...
Scraping page 6...
Scraping page 7...
Scraping page 8...
Scraping page 9...
Scraping page 10...
Scraping page 11...
Scraping page 12...
Scraping page 13...
Scraping page 14...
Scraping page 15...
Scraping page 16...
Scraping page 17...
Scraping page 18...
Scraping page 19...
Scraping page 20...
Scraping page 21...
Scraping page 22...
Scraping page 23...
Scraping page 24...
Scraping page 25...
Scraping page 26...
Scraping page 27...
Scraping page 28...
Scraping page 29...
Scraping page 30...
Scraping page 31...
Scraping page 32...
Scraping page 33...
Scraping page 34...
Scraping page 35...


In [19]:
df 

Unnamed: 0,Clinic_name,Address,Contact_number,Website
0,Northern Beaches GP SuperClinic,"Medical Centres, Deeragun, QLD 4818",tel:0747514000,https://www.nbgpsc.com
1,Salisbury Medical Centre,"Medical Centres, Salisbury, QLD 4107",tel:0732771621,https://www.salisburymedicalcentre.com.au
2,Cairns Eye & Laser Centre,"Medical Centres, Manoora, QLD 4870",tel:0740537877,https://www.cairnseye.com
3,Northern Beaches GP SuperClinic,"Medical Centres, Deeragun, QLD 4818",tel:0747514000,https://www.nbgpsc.com
4,Chermside Medical Centre,"Medical Centres, Chermside, QLD 4032",tel:0739174200,https://chermsidemedicalcentre.com.au
...,...,...,...,...
1255,Barrier Reef Medical Centre,"Medical Centres, Cairns North, QLD 4870",tel:0740516299,http://www.brmc.com.au
1256,Sunnybank Hills Family Practice,"Medical Centres, Sunnybank Hills, QLD 4109",tel:0733618111,http://www.sbhfamilypractice.com.au
1257,Medicine On Second,"Medical Centres, Maroochydore, QLD 4558",tel:0754439455,http://www.medicineonsecond.com.au
1258,Family First Medical Centre & Family First Ski...,"Medical Centres, Urraween, QLD 4655",tel:0741242466,http://www.familyfirstmedicalcentre.com.au


In [20]:
df['Contact_number'] = df['Contact_number'].str.replace('tel:', '')
df['Address'] = df['Address'].str.replace('Medical Centres,', '')



In [21]:
df

Unnamed: 0,Clinic_name,Address,Contact_number,Website
0,Northern Beaches GP SuperClinic,"Deeragun, QLD 4818",0747514000,https://www.nbgpsc.com
1,Salisbury Medical Centre,"Salisbury, QLD 4107",0732771621,https://www.salisburymedicalcentre.com.au
2,Cairns Eye & Laser Centre,"Manoora, QLD 4870",0740537877,https://www.cairnseye.com
3,Northern Beaches GP SuperClinic,"Deeragun, QLD 4818",0747514000,https://www.nbgpsc.com
4,Chermside Medical Centre,"Chermside, QLD 4032",0739174200,https://chermsidemedicalcentre.com.au
...,...,...,...,...
1255,Barrier Reef Medical Centre,"Cairns North, QLD 4870",0740516299,http://www.brmc.com.au
1256,Sunnybank Hills Family Practice,"Sunnybank Hills, QLD 4109",0733618111,http://www.sbhfamilypractice.com.au
1257,Medicine On Second,"Maroochydore, QLD 4558",0754439455,http://www.medicineonsecond.com.au
1258,Family First Medical Centre & Family First Ski...,"Urraween, QLD 4655",0741242466,http://www.familyfirstmedicalcentre.com.au


# Removing duplicate rows

In [22]:
df_no_duplicates = df.drop_duplicates()

df_no_duplicates_specific = df.drop_duplicates(['Contact_number'])


In [23]:
df_no_duplicates_specific

Unnamed: 0,Clinic_name,Address,Contact_number,Website
0,Northern Beaches GP SuperClinic,"Deeragun, QLD 4818",0747514000,https://www.nbgpsc.com
1,Salisbury Medical Centre,"Salisbury, QLD 4107",0732771621,https://www.salisburymedicalcentre.com.au
2,Cairns Eye & Laser Centre,"Manoora, QLD 4870",0740537877,https://www.cairnseye.com
4,Chermside Medical Centre,"Chermside, QLD 4032",0739174200,https://chermsidemedicalcentre.com.au
5,Belgian Gardens Medical Centre,"Belgian Gardens, QLD 4810",0747716666,http://www.bgmc.com.au
...,...,...,...,...
67,Castle Hill Medical Centre,"Murrumba Downs, QLD 4503",0738865100,Not available
68,North Shore Medical Centre,"Mudjimba, QLD 4564",0754489200,https://nsmedical.com.au
69,Majellan Medical Centre,"Scarborough, QLD 4020",0738801444,http://www.redcliffedoctor.com.au
70,Redcliffe Skin Cancer Centre,"Margate, QLD 4019",0732843030,http://www.redcliffeskincancer.com.au


In [24]:
with pd.ExcelWriter('QNS_Clinic detail.xlsx') as writer:
    df_no_duplicates_specific.to_excel(writer, sheet_name='unique', index=False)
    df.to_excel(writer, sheet_name='duplicates', index=False)
    
    
    
