# <span style="color:darkblue"> Lecture 6c - Basics of Web Scraping </span>


<font size = "5">

Import packages for data processing + web scraping

<font size = "3">

- You can see the packages that you are importing in the <br>
subfolder "scripts", in the file "import_packages.py"
- Writing parts of you code in external Python scripts can help <br>
your code be more concise and focused.

<字体大小= " 5 " >

导入数据包用于数据处理+网页抓取

<字体大小= " 3 " >

-您可以在<br>中看到正在导入的包
子文件夹"scripts"，在文件"import_packages.py"
-在外部Python脚本中编写部分代码会有所帮助<br>
您的代码将更加简洁和集中。

In [1]:
# This command executes the python scripts
exec(open("./scripts/import_packages.py").read())


# <span style="color:darkblue"> I. Running Chrome from Python </span>

<font size = "5">

Initialize web driver

<font size = "3">

- This will open a Google Chrome window (do not close it)
- You may have to grant permission to Python to access the browser <br>
via a pop-up window
- If you choose ```options.headless = False``` a new window will appear
- If you choose ```options.headless = True``` all process will be hidden <br>
(you may consider this option after you've got the example automated)


<字体大小= " 5 " >

初始化web驱动程序

<字体大小= " 3 " >

这将打开一个Google Chrome窗口(不要关闭它)
你可能需要授予Python访问浏览器的权限<br>
通过弹出窗口
—如果选择“”选项。headless = False ' ' '将出现一个新窗口
—如果选择“”选项。headless = True ' ' '所有进程将被隐藏<br>
(您可以在示例自动化之后考虑此选项)

In [2]:
options = webdriver.ChromeOptions()

# True hides the navigating of the browser by the scraper,
# False shows you the tab/window opening and stuff getting clicked
# For beginners it is recommended to set this to False.
options.headless = False 

driver = webdriver.Chrome(options=options)

# For advanced users:
# Once you are done you can close the browser automatically by running the 
# following commands at the end of your code
#    driver.close()
#    driver.quit()

<font size = "5">

Define URL

In [3]:
starting_url = 'https://atlas.emory.edu'

<font size = "5">

Open website

<font size = "3">

- For this to work the browser needs to be open
- This should automatically go to the website "starting_url"

In [4]:
driver.get(starting_url)

# <span style="color:darkblue"> II. Basic Navigation </span>

<font size = "5">

Search by HTML tag

<font size = "3">

-  ``` driver.find_element("xpath","//tag) ```
- The "xpath" is a command you should always include. This stands <br>
for a particular syntax of searching for elements within the HTML code
- You can enter any tag, e.g. header, body, div, etc.
- If there are multiple occurences, it will extract the first
- Ideally use single quotations ``` ' ' ```

<字体大小= " 5 " >

按HTML标签搜索

<字体大小= " 3 " >

- ' ' ' driver.find_element("xpath"，"//tag ")' ' '
-“xpath”是一个必须包含的命令。这是<br>
在HTML代码中搜索元素的特定语法
-您可以输入任何标签，例如header, body, div等。
—如果出现多次，则提取第一次
-最好使用单引号' ' ' ' ' ' '

In [5]:
search_header = driver.find_element('xpath','//header') 

<font size = "5">

Extract specific attributes of the HTML tag

<font size = "3">

- Extract the HTML code using the function ``` .get_attribute() ``` <br>
(this is an internal attribute visible only in the code)
- Use ```.text``` to extract all text content in a container <br>
(this is usually what is displayed to the user)

<br>


<img src="figures/screenshot_header.png" alt="drawing" width="400"/>

<字体大小= " 5 " >

提取HTML标记的特定属性

<字体大小= " 3 " >

-使用函数' ' ' .get_attribute()提取HTML代码' ' ' < br >
(这是一个内部属性，只在代码中可见)
-使用' '。Text ' ' '提取容器中的所有文本内容<br>
(这通常是显示给用户的内容)

< br >

In [6]:

header_class = search_header.get_attribute("class")
print("Header class: "  + header_class)

header_role  = search_header.get_attribute("role")
print("Header role: " + header_role)

header_text  = search_header.text
print("Header text: " + header_text)

Header class: banner
Header role: banner
Header text: COURSE ATLAS
Login


<font size = "5">

Browse extracted element

<font size = "3">

- First we extract all the HTML code in the container using <br>
```.get_attribute("outerHTML") ``` <br>
- We then use the functon "BeautfulSoup" to parse it into a nicer format
- You can print it to screen, although you can also print it to a file, <br>
in case the output is tool long. Opening from an external file might be easier <br>
to read. Open from the file explorer in VS-Code


<字体大小= " 5 " >

浏览提取的元素

<字体大小= " 3 " >

-首先，我们使用<br>提取容器中的所有HTML代码
' ' ' .get_attribute(“outerHTML”)' ' ' < br >
-然后我们使用函数“beautifulsoup”将其解析为更好的格式
-你可以打印到屏幕上，虽然你也可以打印到一个文件，<br>
如果输出是工具长。从外部文件打开可能更容易一些<br>
阅读。在VS-Code中从文件资源管理器打开

In [7]:
# Extract HTML code and parse to a nicer format using "Beautiful Soup"
html_code  = search_header.get_attribute("outerHTML")
parse_code = BeautifulSoup(html_code,"html.parser").prettify()

# Uncomment this line to print on screen
# print(parse_code)

# You can also save this to a file to make it easier to read without cluttering
# the jupyter notebook.
# The function open() takes two arguments: 
# (i) the name of the file
# (ii) an option, "r" stands for read, "w" stands for write
# After we've opened the file we can print the output using ".write()"

with open('html_files/diagnose_scraping_outer.html', 'w') as file: 
    file.write(parse_code)

In [8]:
# You can also extract the inside of the container, without the
# header tag.
inner_html_code  = search_header.get_attribute("innerHTML")
parse_inner_code = BeautifulSoup(inner_html_code,"html.parser").prettify()

with open('html_files/diagnose_scraping_inner.html', 'w') as file: 
    file.write(parse_code)

<font size = "5">

Extract subelements by path

<font size = "3">

- Use ```.find_elments()``` (with a plural "s") to find multiple elements
- Browse by path using a slash symbol ```/```
- You can search the absolute path from driver, or a relative path <br>
given elements that you have already extracted.



<字体大小= " 5 " >

按路径提取子元素

<字体大小= " 3 " >

-使用' ' ' . find_elements() ' ' '(带复数"s")查找多个元素
-使用斜杠符号' ' ' ' / ' ' ' ' '按路径浏览
-您可以从驱动程序中搜索绝对路径，或相对路径<br>
给定已经提取的元素。

In [9]:
# Browse by absolute path
search_subdivs_absolute = driver.find_elements('xpath','//header/div') 

# Explicit address
search_subdivs_relative = search_header.find_elements('xpath','div') 


<font size = "5">

Count how many elemenents are extracted

<font size = "3">

- Often searchers can produce more than one result
- This happens when multiple HTML elements share the same tag

<字体大小= " 5 " >

计算提取了多少个元素

<字体大小= " 3 " >

-通常搜索者可以产生多个结果
-当多个HTML元素共享相同的标签时，就会发生这种情况

In [10]:
num_elements = len(search_subdivs_relative)
print(num_elements)

2


<font size = "5">

Browse first lines of extracted elements

<font size = "3">

- Run a "for-loop" over all extracted elements
- We extract elements using square brackets and the index [i]
- We use ```.splitlines()``` to split the HTML code

In [11]:
# The syntax of a loop is
# for element in list_elements:
#   Body of code
# Here range(num_elements) creates an iterable list from 0 to (num_elements - 1).

for i in range(num_elements):
    
    # Extract HTML for element "i"
    html_code  = search_subdivs_relative[i].get_attribute("innerHTML")
    # Parse HTML code    
    parse_code = BeautifulSoup(html_code,"html.parser").prettify()
    # Split HTML into multiple lines and print the first one, in position [0].
    print(parse_code.splitlines()[0])


<a class="header-icon logout-btn" href="#" onclick="return sam.logout(event)">
<a class="anon-only" data-action="login" href="#">


<font size = "5">

Try it yourself!

<font size = "3">

- Use ```.findelements``` to find all elements with the tag ```div```
- Count how many elements are found.

    Note: ```div``` is a very common tag in websites and you will likely <br>
    find many elements. This can motivate the targeted options we will see in the <br>
    next section



In [12]:
# Write your own code

# Explicit address
search_subdivs_relative = search_header.find_elements('xpath','//div') 
num_elements = len(search_subdivs_relative)
print(num_elements)

54


<font size = "5">

<font size = "5">

Try it yourself!

<font size = "3">

- Store the first HTML line of each element into an external file <br>
that you should call ```diagnose_div.html``` 
- HINT: Embed the code chunk in "Browse first lines of extracted elements" <br>
into code using the ```with open()``` function. Change print to <b>
```file.write()```




-将每个元素的第一行HTML存储到一个外部文件<br>
你应该调用' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
-提示:将代码块嵌入“浏览所提取元素的第一行”<br>
使用' ' ' ' with open() ' ' '函数进入代码。将打印更改为<b>
' ' ' file.write () ' '

In [17]:
# Write your own code

with open('html_files/diagnose_div.html', 'w') as file: 
    file.write(parse_code)



# <span style="color:darkblue"> III. Targeted Navigation </span>

<font size = "5">

Sometimes searches return too many values

<font size = "3">

- It is useful to do targeted navigation to get exactly what we need
- For better results, browse your intended website in Google Chrome <br>
using Developer Tools, and find identifiable tag + attrbute combinations

# <span>目标导航</span>

<字体大小= " 5 " >

有时搜索返回的值太多

<字体大小= " 3 " >

-这是有用的做有针对性的导航，以得到我们所需要的
-为了更好的结果，浏览您的目标网站在谷歌浏览器<br>
使用开发人员工具，并找到可识别的标签+属性组合

<font size = "5">

Search by tag + attribute value

<font size = "3">

- Use syntax ``` '//tag[@attribute_name = "attribute_value]' ```


In [14]:
search_div = driver.find_elements('xpath','//div[@class = "banner__auth"]') 

# Count elements found
print(len(search_div))

1


<font size = "5">

Search by attribute for any tag name

<font size = "3">

- Use syntax ``` '//*[@attribute_name = "attribute_value]' ```
- The star  * indicates that Python should search any tag name.


In [15]:
search_div = driver.find_elements('xpath','//*[@class = "banner__auth"]') 

# Coun elements 
print(len(search_div))

1


<font size = "5">

Try it yourself!

<font size = "3">

- Search for all ```div``` with the ```class``` attribute equal to  <br>
``` "form_control" ```
- Store the first line of these elements to a file for diagnostics.


In [18]:
#  Write your own Code

search_div = driver.find_elements('xpath','//div[@class = "form_control"]') 

# Coun elements 
print(len(search_div))



0
