#### This notebook contains all essential information crucial for utilizing Selenium, BeautifulSoup4 and Request packages in automating web-based applications.

#### Prepared and created by: Grace Choo, Data Analytics Manager

#### Created on: 30 November 2023

This notebook is develop in VS Code. For further information regarding VS code download and set up, you may refer to this very useful guide.

https://www.youtube.com/watch?v=zulGMYg0v6U

### Main Package that you will need to install are the following:

*selenium* - this is so that python is able to interact with web browser.

*beautifulsoup4* - Beautiful Soup is a library that makes it easy to scrape information from web pages.

*requests* - Requests allows you to send HTTP/1.1 requests extremely easily.

# To test if package already installed

In [1]:
import selenium
from bs4 import BeautifulSoup
import requests

# Selenium method

We will be using Microsoft Edge as the web browser. Hence, ensure that you have the following in your PC:
    
    1. Microsoft Edge
    2. webdriver for Microsoft Edge. To get this, you can just Google Edge webdriver and download it. The webdriver for Edge should be something like this "msedgedriver.exe".
    Download the webdriver from https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
    Make sure that ther webdriver version matches the version of Microsoft Edge you have in your PC
    3. make sure that the browser location, 'webdriver_loc' in the code below refers to the correct location where you paste the webdriver.

## Importing Packages

In [2]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.common.exceptions import InvalidArgumentException
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException

## go to website

In [23]:
#Webdriver location:
webdriver_loc = r"C:\Users\Grace.Choo\OneDrive - Zurich APAC\Documents - MY Analytics\Admin\Training\Python Selenium Training"

browser = webdriver.Edge(webdriver_loc + "\msedgedriver.exe")

#go to the website
browser.get('https://www.google.com/')

  browser = webdriver.Edge(webdriver_loc + "\msedgedriver.exe")


## find element by xpath

This is to let python know what element you want to interact with.

hover your mouse to the item of interest and right click > select 'inspect'
right click the highlighted element and then select Copy Xpath
you should get the following code when you paste in your code:

/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input

In [6]:
#This is the google search bar
browser.find_element(By.XPATH,'//*[@id="APjFqb"]')

<selenium.webdriver.remote.webelement.WebElement (session="7382f612c97019abfc8769052d21bb5f", element="26EDEEC7E0CD4B2D27846770E0D76043_element_2")>

In [27]:
#You can give the above code a name like so (to make the code looks cleaner):
SearchBar = browser.find_element(By.XPATH,'//*[@id="APjFqb"]')

## send_keys()
This is to submit words/phrases into a search bar or input bar in the website.

In [32]:
SearchBar.send_keys('Zurich Insurance Malaysia')

## click()
This is to click on specified element. need to use a combination of find_element and XPATH.

In [34]:
#click on the google search button
browser.find_element(By.XPATH,'/html/body/div[1]/div[3]/form/div[1]/div[1]/div[4]/center/input[1]').click()

In [31]:
#You may also code it like so:
SearchButton = browser.find_element(By.XPATH,'/html/body/div[1]/div[3]/form/div[1]/div[1]/div[4]/center/input[1]')
SearchButton.click()

## .text
This is to extract the text for that particular element.

In [35]:
ResultStats = browser.find_element(By.XPATH,'//*[@id="result-stats"]')
ResultStats.text

'About 8,530,000 results (0.33 seconds) '

In [36]:
#You may assign a variable to the text
Result = ResultStats.text
print(Result)

About 8,530,000 results (0.33 seconds) 


## WebDriverWait
This is to ask browser to wait for X seconds until the element is fully loaded before proceed to the next line of code. If the element loads exceeded X seconds, there will be a TimeoutException error.

In [None]:
#Eg. we set X = 30 seconds.
WebDriverWait(browser, 30).until(EC.visibility_of_element_located((By.XPATH, """/html/body/div[1]/div[3]/form/div[1]/div[1]/div[4]/center/input[1]""")))

## refresh()
This is to refresh browser. Sometimes website have issues and you want to start over, you can refresh the browser without having to start over from the top.

In [18]:
browser.refresh()

## quit()
This is to close browser.

In [19]:
browser.quit()

# Python code Basics

## Import Files

### Import Excel Files

In [22]:
import pandas as pd

df = pd.read_excel(r'Path where the Excel file is stored\File name.xlsx', sheet_name='your Excel sheet name')
df.head()

### Import .csv or .txt files

In [None]:
import pandas as pd
filename=r"{yourfilepath and filename}"
df = pd.read_csv(filenamename, sep='^') #sep = separator/delimiter
df.head()

## Export Files

### Export as Excel

In [None]:
filename=r"{yourfilename}.xlsx"
df.to_excel(filename, index=False, header=True)

## Create Pandas Dataframe

In [39]:
import pandas as pd

# create a dictionary with some data
data = {'COL1': ['ABC', 'CDE', 'EFG', 'HIJ'],
        'NUM1': [25, 30, 21, 29],
        'country': ['USA', 'Canada', 'Australia', 'UK']}

# create a DataFrame from the dictionary
df = pd.DataFrame(data)
df

Unnamed: 0,COL1,NUM1,country
0,ABC,25,USA
1,CDE,30,Canada
2,EFG,21,Australia
3,HIJ,29,UK


In [49]:
#Create pandas dataframe with only 1 row
import pandas as pd
df = pd.DataFrame({'COL1': 'ABC',
                   'NUM1':25,
                   'Country': 'USA'}, index=[0])
df

Unnamed: 0,COL1,NUM1,Country
0,ABC,25,USA


In [50]:
#You can assign the items you've extracted and create dataframe table
#For example:

#let TestCaseNum = 1.
TestCaseNum = 1

#Import module and getting current date and time
import datetime as dt
CurrDateTime = dt.datetime.now()

import pandas as pd
df = pd.DataFrame({'TestCase': TestCaseNum,
        'Result From Web': Result,
        'DateTime': CurrDateTime}, index=[0])
df

Unnamed: 0,TestCase,Result From Web,DateTime
0,1,"About 8,530,000 results (0.33 seconds)",2023-12-01 16:41:24.912364


## Storing Results in a list

In [51]:
ResultList = []
ResultList.append(df)
ResultList

[   TestCase                          Result From Web  \
 0         1  About 8,530,000 results (0.33 seconds)    
 
                     DateTime  
 0 2023-12-01 16:41:24.912364  ]

## Transfrom List into dataframe

In [52]:
df2 = pd.concat(ResultList)
df2

Unnamed: 0,TestCase,Result From Web,DateTime
0,1,"About 8,530,000 results (0.33 seconds)",2023-12-01 16:41:24.912364


## Time/Datetime Basics

### time.sleep()
This is to specify python to count/wait for x seconds before proceed to the next line. For example:

In [67]:
import time
time.sleep(5)
print('Wait for 5 seconds')

Wait for 5 seconds


### datetime.now()
To record the current date time.

In [66]:
#Import module
import datetime as dt

#Current datetime
CurrDateTime = dt.datetime.now()

#print out the current date time using the following format:
print('Current Time only (DD-MM-YYYY HH:MM:SS PM/AM): ' + CurrDateTime.strftime("%d-%m-%Y %I:%M:%S %p"))
print('Current Date only (YYYYMMDD format): ' + CurrDateTime.strftime('%Y%m%d'))
print('Current Time only (HH:MM:SS PM/AM): ' + CurrDateTime.strftime("%I:%M:%S %p"))

Current Time only (DD-MM-YYYY HH:MM:SS PM/AM): 01-12-2023 04:50:50 PM
Current Date only (YYYYMMDD format): 20231201
Current Time only (HH:MM:SS PM/AM): 04:50:50 PM


## try, except statement
The Python try…except statement runs the code under the “try” statement. If this code does not execute successfully, the program will stop at the line that caused the error and the “except” code will run. 

try:
    
    # Some Code

except:

    # Executed if error in the 'try' section

For example:

In [65]:
try:
    result = 6 // 0
    print("Yeah ! Your answer is : " + str(result))
    
except ZeroDivisionError:
    
    print("Sorry ! You are dividing by zero ")

Sorry ! You are dividing by zero 


## if, else statement
if {test expression}:


    # Code for If

elif {test expression}:

    # Code for elif

else:

    # Code for else

In [64]:
num = 0

if num > 0:
    print("Positive number")
elif num == 0:
    print("Zero")
else:
    print("Negative number")

Zero


## For Loop (break, continue, pass)

In [63]:
for i in range(10):
    print(i)

0
1
2
3
4
5
6
7
8
9


### break

In [62]:
for i in range(10):
    if i == 5:
        print('This is number 5')
        break #to stop the loop if reaches number 5.
    print(i)

0
1
2
3
4
This is number 5


### continue

In [61]:
for i in range(10):
    if i == 5:
        print('This is number 5')
        continue # continue the loop, skipping print(5)
    print(i)

0
1
2
3
4
This is number 5
6
7
8
9


### pass

In [60]:
for i in range(10):
    if i == 5:
        print('This is number 5')
        pass # continue with the remaining code and then continue with the loop
    print(i)

0
1
2
3
4
This is number 5
5
6
7
8
9


# Example of how this can be done.
Combining everything we have learned so far:

In [69]:
import datetime as dt
import pandas as pd
import time

ResultList = []

for i in range(5):
    TestCaseNum = i
    CurrDateTime = dt.datetime.now()

    df = pd.DataFrame({'TestCase': TestCaseNum,
                       'Result From Web': Result,
                       'DateTime': CurrDateTime}, index=[0])
    
    time.sleep(0.5)
    ResultList.append(df)

df2 = pd.concat(ResultList)
df2

Unnamed: 0,TestCase,Result From Web,DateTime
0,0,"About 8,530,000 results (0.33 seconds)",2023-12-01 16:51:59.170673
0,1,"About 8,530,000 results (0.33 seconds)",2023-12-01 16:51:59.677800
0,2,"About 8,530,000 results (0.33 seconds)",2023-12-01 16:52:00.184085
0,3,"About 8,530,000 results (0.33 seconds)",2023-12-01 16:52:00.687048
0,4,"About 8,530,000 results (0.33 seconds)",2023-12-01 16:52:01.189504


## .tolist()
To make one of the dataframe column as a list. This is useful for looping.
using df2 data from above

In [73]:
NewList = df2['TestCase'].tolist()
for i in NewList:
    print(i)

0
1
2
3
4


In [74]:
NewList2 = df2['DateTime'].tolist()
for i in NewList2:
    print(i)

2023-12-01 16:51:59.170673
2023-12-01 16:51:59.677800
2023-12-01 16:52:00.184085
2023-12-01 16:52:00.687048
2023-12-01 16:52:01.189504


# BeautifulSoup and Requests Method

In [68]:
from bs4 import BeautifulSoup
import requests

In [18]:
url = 'https://www.google.com/'
requests.get(url)

<Response [200]>

if response have the following code, the requests is bad and not able to connect to the webpage.

1. 204
2. 400
3. 401
4. 404

In [20]:
page = requests.get(url)
BeautifulSoup(page.text,'html')

<!DOCTYPE html>

<html dir="ltr" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Comprehensive Listing of Malaysia Postcode with Search and Look-up Functions" name="description">
<meta content="Postcode, Search, Look-up, Malaysia, Poskod, GPS, Latitude, Longitude, Coordinates" name="keywords">
<title>Malaysia Postcode Search &amp; Lookup</title>
<script src="/cdn-cgi/apps/head/lh0g369DraHnS4mQINJRU7qH6Ok.js"></script><link href="/template/img/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<script src="/template/js/jquery.js" type="text/javascript"></script>
<script src="/template/js/javascript.js" type="text/javascript"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
<link href="https://maxcdn.boo

In [21]:
soup = BeautifulSoup(page.text,'html')
print(soup)

<!DOCTYPE html>

<html dir="ltr" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Comprehensive Listing of Malaysia Postcode with Search and Look-up Functions" name="description">
<meta content="Postcode, Search, Look-up, Malaysia, Poskod, GPS, Latitude, Longitude, Coordinates" name="keywords">
<title>Malaysia Postcode Search &amp; Lookup</title>
<script src="/cdn-cgi/apps/head/lh0g369DraHnS4mQINJRU7qH6Ok.js"></script><link href="/template/img/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<script src="/template/js/jquery.js" type="text/javascript"></script>
<script src="/template/js/javascript.js" type="text/javascript"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
<link href="https://maxcdn.boo

In [22]:
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="Comprehensive Listing of Malaysia Postcode with Search and Look-up Functions" name="description">
   <meta content="Postcode, Search, Look-up, Malaysia, Poskod, GPS, Latitude, Longitude, Coordinates" name="keywords">
    <title>
     Malaysia Postcode Search &amp; Lookup
    </title>
    <script src="/cdn-cgi/apps/head/lh0g369DraHnS4mQINJRU7qH6Ok.js">
    </script>
    <link href="/template/img/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
    <script src="/template/js/jquery.js" type="text/javascript">
    </script>
    <script src="/template/js/javascript.js" type="text/javascript">
    </script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7

## Find and Find all

### .find()

In [25]:
#to find the first match
soup.find('div')

<div id="fb-root"></div>

### .find_all()

In [None]:
#to find all matches
soup.find_all('div')

In [26]:
#to find all matches
soup.find_all('div',class_ = '')

[<div id="fb-root"></div>,
 <div id="header">
 <div class="navbar navbar-custom navbar-fixed-top" role="navigation">
 <div class="container">
 <div class="top_header">
 <div id="logo">
 <a href="/">
 <img alt="Malaysia Postcode Search &amp; Lookup" src="/template/img/logo.png" title="Malaysia Postcode Search &amp; Lookup"/>
 </a>
 </div>
 <div id="nav">
 <a href="/browse/">Browse Postcodes</a>
 <a href="/location/">Browse Locations</a>
 <a href="/contact/">Contact Us</a>
 </div>
 <div class="menu_top_trigger">
 <span class="menu_trigger_text">Ξ</span>
 </div>
 </div>
 </div>
 </div>
 </div>,
 <div id="logo">
 <a href="/">
 <img alt="Malaysia Postcode Search &amp; Lookup" src="/template/img/logo.png" title="Malaysia Postcode Search &amp; Lookup"/>
 </a>
 </div>,
 <div id="nav">
 <a href="/browse/">Browse Postcodes</a>
 <a href="/location/">Browse Locations</a>
 <a href="/contact/">Contact Us</a>
 </div>,
 <div id="menu_top">
 <div class="top_nav">
 <a href="/browse/">Browse Postcodes</a