#### This notebook contains all essential information crucial for utilizing Selenium, BeautifulSoup4 and Request packages in automating web-based applications.

#### Prepared and created by: Grace Choo, Data Analytics Manager

#### Created on: 30 November 2023

This notebook is develop in VS Code. For further information regarding VS code download and set up, you may refer to this very useful guide.
https://www.youtube.com/watch?v=zulGMYg0v6U

### Main Package that you will need to install are the following:

*selenium* - this is so that python is able to interact with web browser.

*beautifulsoup4* - Beautiful Soup is a library that makes it easy to scrape information from web pages.

*requests* - Requests allows you to send HTTP/1.1 requests extremely easily.

# To test if package already installed

In [6]:
import selenium
from bs4 import BeautifulSoup
import requests

# 1. Selenium method

We will be using Microsoft Edge as the web browser. Hence, ensure that you have the following in your PC:
    
    1. Microsoft Edge
    2. webdriver for Microsoft Edge. To get this, you can just Google Edge webdriver and download it. The webdriver for Edge should be something like this "msedgedriver.exe".
    Download the webdriver from https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
    Make sure that ther webdriver version matches the version of Microsoft Edge you have in your PC
    3. make sure that the browser location, 'webdriver_loc' in the code below refers to the correct location where you paste the webdriver.


#### Importing Packages

In [28]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.common.exceptions import InvalidArgumentException
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException

## go to website

In [31]:
#Webdriver location:
webdriver_loc = r"C:\Users\cjyi3\OneDrive\Documents\Python Projects\202311_Webscrapping"

browser = webdriver.Edge(webdriver_loc + "\msedgedriver.exe")

#go to the website
browser.get('https://www.google.com/')

AttributeError: 'str' object has no attribute 'capabilities'

## find element by xpath

This is to let python know what element you want to interact with.

hover your mouse to the item of interest and right click > select 'inspect'
right click the highlighted element and then select Copy Xpath
you should get the following code when you paste in your code:

/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input

In [None]:
#This is the google search bar
browser.find_element(By.XPATH,'/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input')

In [None]:
#You can give the above code a name like so (to make the code looks cleaner):
SearchBar = browser.find_element(By.XPATH,'/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input')

## send_keys()
This is to submit words/phrases into a search bar or input bar in the website. Continue from 4.1.2 above:

In [None]:
SearchBar.send_keys('Zurich Insurance Malaysia')

## click()

This is to click on specified element. need to use a combination of find_element and XPATH.

In [None]:
#click on the google search button
browser.find_element(By.XPATH,'/html/body/div[1]/div[3]/form/div[1]/div[1]/div[3]/center/input[1]').click()

In [None]:
#You may also code it like so:
SearchButton = browser.find_element(By.XPATH,'/html/body/div[1]/div[3]/form/div[1]/div[1]/div[3]/center/input[1]')
SearchButton.click()


## WebDriverWait
This is to ask browser to wait for X seconds until the element is fully loaded before proceed to the next line of code. If the element loads exceeded X seconds, there will be a TimeoutException error.

In [None]:
#Eg. we set X = 30 seconds.
WebDriverWait(browser, 30).until(EC.visibility_of_element_located((By.XPATH, """/html/body/div[1]/div[3]/form/div[1]/div[1]/div[3]/center/input[1]""")))

## refresh()
This is to refresh browser. Sometimes website have issues and you want to start over, you can refresh the browser without having to start over from the top.

In [None]:
browser.refresh()

## quit()
This is to close browser.

In [None]:
browser.quit()

# Python code Basics

## Time/Datetime Basics

## time.sleep()
This is to specify python to count/wait for x seconds before proceed to the next line. For example:

In [None]:
import time
time.sleep(5)
print('Wait for 5 seconds')

## datetime.now()
To record the current date time.

In [None]:
#Import module
import datetime as dt

#Current datetime
CurrDateTime = dt.datetime.now()

#print out the current date time using the following format:
print('Current Time only (DD-MM-YYYY HH:MM:SS PM/AM): ' + CurrDateTime.strftime("%d-%m-%Y %I:%M:%S %p"))
print('Current Date only (YYYYMMDD format): ' + CurrDateTime.strftime('%Y%m%d'))
print('Current Time only (HH:MM:SS PM/AM): ' + CurrDateTime.strftime("%I:%M:%S %p"))

## try, except statement
The Python try…except statement runs the code under the “try” statement. If this code does not execute successfully, the program will stop at the line that caused the error and the “except” code will run. 

try:
    
    # Some Code

except:

    # Executed if error in the 'try' section

For example:

In [32]:
try:
    result = 6 // 0
    print("Yeah ! Your answer is : " + str(result))
except ZeroDivisionError:
    print("Sorry ! You are dividing by zero ")

Sorry ! You are dividing by zero 


## if, else statement
if {test expression}:


    # Code for If

elif {test expression}:

    # Code for elif

else:

    # Code for else

In [None]:
num = 0

if num > 0:
    print("Positive number")
elif num == 0:
    print("Zero")
else:
    print("Negative number")

## For Loop (break, continue, pass)

In [None]:
for i in range(10):
    print(i)

### break

In [None]:
for i in range(10):
    if i == 5:
        print('This is number 5')
        break #to stop the loop if reaches number 5.
    print(i)

### continue

In [None]:
for i in range(10):
    if i == 5:
        print('This is number 5')
        continue # continue the loop, skipping print(5)
    print(i)

### pass

In [None]:
for i in range(10):
    if i == 5:
        print('This is number 5')
        pass # continue with the remaining code and then continue with the loop
    print(i)

# 2. BeautifulSoup and Requests Method

In [None]:
from bs4 import BeautifulSoup
import requests

In [18]:
url = 'https://www.google.com/'
requests.get(url)

<Response [200]>

if response have the following code, the requests is bad and not able to connect to the webpage.

1. 204
2. 400
3. 401
4. 404

In [20]:
page = requests.get(url)
BeautifulSoup(page.text,'html')

<!DOCTYPE html>

<html dir="ltr" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Comprehensive Listing of Malaysia Postcode with Search and Look-up Functions" name="description">
<meta content="Postcode, Search, Look-up, Malaysia, Poskod, GPS, Latitude, Longitude, Coordinates" name="keywords">
<title>Malaysia Postcode Search &amp; Lookup</title>
<script src="/cdn-cgi/apps/head/lh0g369DraHnS4mQINJRU7qH6Ok.js"></script><link href="/template/img/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<script src="/template/js/jquery.js" type="text/javascript"></script>
<script src="/template/js/javascript.js" type="text/javascript"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
<link href="https://maxcdn.boo

In [21]:
soup = BeautifulSoup(page.text,'html')
print(soup)

<!DOCTYPE html>

<html dir="ltr" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Comprehensive Listing of Malaysia Postcode with Search and Look-up Functions" name="description">
<meta content="Postcode, Search, Look-up, Malaysia, Poskod, GPS, Latitude, Longitude, Coordinates" name="keywords">
<title>Malaysia Postcode Search &amp; Lookup</title>
<script src="/cdn-cgi/apps/head/lh0g369DraHnS4mQINJRU7qH6Ok.js"></script><link href="/template/img/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<script src="/template/js/jquery.js" type="text/javascript"></script>
<script src="/template/js/javascript.js" type="text/javascript"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
<link href="https://maxcdn.boo

In [22]:
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="Comprehensive Listing of Malaysia Postcode with Search and Look-up Functions" name="description">
   <meta content="Postcode, Search, Look-up, Malaysia, Poskod, GPS, Latitude, Longitude, Coordinates" name="keywords">
    <title>
     Malaysia Postcode Search &amp; Lookup
    </title>
    <script src="/cdn-cgi/apps/head/lh0g369DraHnS4mQINJRU7qH6Ok.js">
    </script>
    <link href="/template/img/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
    <script src="/template/js/jquery.js" type="text/javascript">
    </script>
    <script src="/template/js/javascript.js" type="text/javascript">
    </script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7

## 2.1 Find and Find all

### 2.1.1 .find()

In [25]:
#to find the first match
soup.find('div')

<div id="fb-root"></div>

### 2.1.1 .find_all()

In [None]:
#to find all matches
soup.find_all('div')

In [26]:
#to find all matches
soup.find_all('div',class_ = '')

[<div id="fb-root"></div>,
 <div id="header">
 <div class="navbar navbar-custom navbar-fixed-top" role="navigation">
 <div class="container">
 <div class="top_header">
 <div id="logo">
 <a href="/">
 <img alt="Malaysia Postcode Search &amp; Lookup" src="/template/img/logo.png" title="Malaysia Postcode Search &amp; Lookup"/>
 </a>
 </div>
 <div id="nav">
 <a href="/browse/">Browse Postcodes</a>
 <a href="/location/">Browse Locations</a>
 <a href="/contact/">Contact Us</a>
 </div>
 <div class="menu_top_trigger">
 <span class="menu_trigger_text">Ξ</span>
 </div>
 </div>
 </div>
 </div>
 </div>,
 <div id="logo">
 <a href="/">
 <img alt="Malaysia Postcode Search &amp; Lookup" src="/template/img/logo.png" title="Malaysia Postcode Search &amp; Lookup"/>
 </a>
 </div>,
 <div id="nav">
 <a href="/browse/">Browse Postcodes</a>
 <a href="/location/">Browse Locations</a>
 <a href="/contact/">Contact Us</a>
 </div>,
 <div id="menu_top">
 <div class="top_nav">
 <a href="/browse/">Browse Postcodes</a