# <span style="color:darkblue"> Lecture 6d - Website Interaction </span>


<font size = "5">

Import packages for data processing + web scraping


In [1]:
# This command executes the python scripts
exec(open("./scripts/import_packages.py").read())


# <span style="color:darkblue"> I. Initialize Web Driver </span>

In [2]:
# Open browser to start web scraping
options = webdriver.ChromeOptions()
options.headless = False 
driver = webdriver.Chrome(options=options)

# Navigate to specific website
starting_url = 'https://atlas.emory.edu'
driver.get(starting_url)

# <span style="color:darkblue"> II. Interact with forms </span>

<font size = "5">

Emory's Course Atlas website has a searchable form <br>


<font size = "5">


<img src="figures/screenshot_courseatlas_form.png" alt="drawing" width="200"/>


<font size = "5">

Find a form element

<font size = "3">

- Here we search by tag and attribute value
- You can find the "xpath" by going into Developer Tools in Google Chrome

<字体大小= " 5 " >

查找表单元素

<字体大小= " 3 " >

-这里我们通过标签和属性值进行搜索
-你可以找到“xpath”通过进入开发人员工具在谷歌浏览器

In [3]:
form_semester = driver.find_element('xpath','//select[@id="crit-srcdb"]')


<font size = "5">

Convert to form

<img src="figures/screenshot_semester_form.png" alt="drawing" width="200"/>

<font size = "3">

- Must convert HTML element suing ```Select()``` function to <br>
enable interactive form capabilities
- Extract the displayed option values with  ```.text``` for each element <br>
or the internal HTML values with 
- We extract the options using a loop

In [4]:
# "Select" is a function from the selenium package
# ".options" extracts the options for Selenium
select_semester = Select(form_semester)
store_options   = select_semester.options
num_options     = len(store_options)

# Store options into a list
list_options = []
for i in range(num_options):
    # This extracts the displayed to 
    option_text = store_options[i].text    
    
    # This extracts the internal HTML values
    # Uncomment this line to display option values instead of displayed text
    # option_text = store_options[i].get_attribute("value")
    list_options.append( option_text )   

list_options

['Fall 2024',
 'Summer 2024',
 'Spring 2024',
 'Fall 2023',
 'Summer 2023',
 'Spring 2023',
 'Fall 2022',
 'Summer 2022',
 'Spring 2022',
 'Fall 2021',
 'Summer 2021',
 'Spring 2021',
 'Fall 2020',
 'Summer 2020',
 'Spring 2020',
 'Fall 2019',
 'Summer 2019',
 'Spring 2019']

<font size = "5">

Select an option

<font size = "3">

- The code displays three equivalent ways of selecting the value <br>
- Once you run this line you should see the Google Chrome browser <br>
interactively select that option, as though you had clicked on it

<字体大小= " 5 " >

选择一个选项

<字体大小= " 3 " >

-代码显示了选择<br>值的三种等效方法
-一旦你运行这行，你应该看到谷歌Chrome浏览器<br>
交互式地选择那个选项，就像你点击了它一样

In [5]:
# Select by visible text
select_semester.select_by_visible_text('Spring 2024')

# select_semester.select_by_index(2)
# select_semester.select_by_value('5241')


<font size = "5">

Type text into a search form

<font size = "3">

- Double check on the browser to see that the information <br>
was entered.

In [6]:
# Find element to enter text
form_search = driver.find_element('xpath','//input[@id = "crit-keyword"]') 

# Clear the text before typing anything
form_search.clear()

# Type something on the screen in the location of that particular element
form_search.send_keys("qtm")

<font size = "5">

Mimic entering "Return" on the keyboard

In [7]:
# "Keys" is a Python object that stands for keyboard
form_search.send_keys(Keys.RETURN)

<font size = "5">

Click on an element

In [17]:
# Find element
form_button = driver.find_element('xpath','//*[@id="search-button"]') 

# Click on element
# Note: Here the element that we choose happens to be a button, but 
# in general you can click on any element, the same way you would do 
# as a user.
form_button.click()

<font size = "5">

Try it yourself!

<font size = "3">

- Find the ``` xpath ``` for the drop-down list "Any Career"
- Select the option "Emory College"
- Send the form


In [19]:
# Write your own code

# Find element
form_button = driver.find_element('xpath','//*[@id="crit-content-1464624409188"]/div[6]') 
form_button.click()
emory_college_option = driver.find_element('xpath', '//*[@id="crit-career"]/option[2]')
emory_college_option.click()


# <span style="color:darkblue"> III. Interact and Extract Data </span>

<font size = "5">

Some content is only available after interaction

<font size = "3">

- For example, in the course Atlas, some information becomes visible <br>
after you search.
- You need to make sure that the Python code mimics the process <br>
you would use if you were to navigate the website.


<字体大小= " 5 " >

有些内容只有在交互之后才可用

<字体大小= " 3 " >

-例如，在课程Atlas中，一些信息变得可见<br>
在你搜索之后。
-你需要确保Python代码模仿的过程<br>
如果你要浏览网站，你会使用。

<font size = "5">

Enter text in a form

In [10]:
form_search = driver.find_element('xpath','//input[@id = "crit-keyword"]') 
form_search.clear()
form_search.send_keys("computing")
form_search.send_keys(Keys.RETURN)

<font size = "5">

Extract list of resulting options

<font size = "3">

- Find "xpath" by browsing the website in Developer mode

In [11]:
list_results = driver.find_elements('xpath','//div[@class = "result result--group-start"]') 

# Check number of results
num_results = len(list_results)
print(num_results)


43


<font size = "5">

Extract information of a particular search result

<font size = "3">

- Conduct search for tags + attributes on subelements
- Use ```.text()``` to extract content
- In this example we are searching for attributes called "result_code" <br>
and "result__title". These names just appear for this Emory website example. <br>
In this example we are also extracting an HTML tag of type "span".


<字体大小= " 5 " >

提取特定搜索结果的信息

<字体大小= " 3 " >

-对子元素进行标签和属性搜索
—使用```.text()```提取内容
-在这个例子中，我们正在搜索名为“result_code”的属性<br>
和“result__title”。这些名字只出现在这个埃默里网站的例子中。< br >
在这个例子中，我们还提取了一个类型为“span”的HTML标签。

In [12]:
emory_coursecode = list_results[0].find_element('xpath','//span[@class = "result__code"] ').text
emory_classname  = list_results[0].find_element('xpath','//span[@class = "result__title"] ').text

print(emory_coursecode)
print(emory_classname)


BIOL 212
Computational Modeling for Scientists and Engineers


<font size = "5">

Store extracted data

<font size = "3">

- Conduct search for tags + attributes on subelements
- Use ```.text()``` to extract content
- In this example we are searching for attributes called "result_code" <br>
and "result__title". These names just appear for this Emory website example. <br>
In this example we are also extracting an HTML tag of type "span".


<字体大小= " 5 " >

存储提取的数据

<字体大小= " 3 " >

-对子元素进行标签和属性搜索
—使用```.text()```提取内容
-在这个例子中，我们正在搜索名为“result_code”的属性<br>
和“result__title”。这些名字只出现在这个埃默里网站的例子中。< br >
在这个例子中，我们还提取了一个类型为“span”的HTML标签。

In [13]:
data = []
for i in range(num_results):
    # Extract data for a specific search results
    # We use parenthesis to split the code into multiple lines to keep
    # things organized.
    emory_coursecode = (list_results[i]
                        .find_element('xpath','//span[@class = "result__code"] ')
                        .text)
    emory_coursename  = (list_results[i]
                        .find_element('xpath','//span[@class = "result__title"] ')
                        .text)
    
    # Append data as dictionary
    data.append({"coursecode": emory_coursecode,
                 "coursename": emory_classname})

# Convert to Pandas DataFrame
data_df = pd.DataFrame(data)

data_df

Unnamed: 0,coursecode,coursename
0,BIOL 212,Computational Modeling for Scientists and Engi...
1,BIOL 212,Computational Modeling for Scientists and Engi...
2,BIOL 212,Computational Modeling for Scientists and Engi...
3,BIOL 212,Computational Modeling for Scientists and Engi...
4,BIOL 212,Computational Modeling for Scientists and Engi...
5,BIOL 212,Computational Modeling for Scientists and Engi...
6,BIOL 212,Computational Modeling for Scientists and Engi...
7,BIOL 212,Computational Modeling for Scientists and Engi...
8,BIOL 212,Computational Modeling for Scientists and Engi...
9,BIOL 212,Computational Modeling for Scientists and Engi...


<font size = "5">

Try it yourself!

<font size = "3">

- Expand the dataset above by including the meeting time
- To do so, browse the Emory Course Atlas website in develop mode <br>
and find the tag that is used to denote the course meeting time.
- Try to extract this first for a single element ``` list_results[0] ``` <br>
and once this is working incorporate it into a loop.


Note: In general it's good practice to make sure the code is running <br>
on a single element before automating the process in a loop.

<字体大小= " 5 " >

你自己试试吧!

<字体大小= " 3 " >

-通过包括会议时间来扩展上述数据集
-要做到这一点，在开发模式下浏览埃默里课程图集网站<br>
并找到用于表示课程会议时间的标记。
-尝试先提取单个元素' ' ' list_results[0] ' ' ' <br>
一旦它开始工作，将它合并到一个循环中。


注意:一般来说，确保代码正在运行<br>是一种良好的做法。
在自动化循环过程之前对单个元素进行处理。

In [14]:
# Write your own code

data = []
for i in range(num_results):
    # Extract data for a specific search results
    # We use parenthesis to split the code into multiple lines to keep
    # things organized.
    emory_coursecode = (list_results[i]
                        .find_element('xpath','//span[@class = "result__code"] ')
                        .text)
    emory_coursename  = (list_results[i]
                        .find_element('xpath','//span[@class = "result__title"] ')
                        .text)
    emory_coursetime  = (list_results[i]
                        .find_element('xpath','//span[@class = "result__title"] ')
                        .text)
    
    # Append data as dictionary
    data.append({"coursecode": emory_coursecode,
                 "coursename": emory_classname,
                 "coursetime": emory_coursetime})

# Convert to Pandas DataFrame
data_df = pd.DataFrame(data)

data_df



# <span style="color:darkblue"> IV. Additional Links </span>

<font size = "5">

More information on Selenium:

https://www.selenium.dev/documentation/overview/

Details on clicking through forms:

https://www.selenium.dev/documentation/webdriver/support_features/select_lists/
