# Web Scraping workshop

## 0. Some basic questions

### 1. What is web scraping and why is it important?
>#### 1. Collecting raw data from a site and parsing it down to the information that is of use to the user called web scraping
>#### 2. The reason it is needed is:
>>#### 1. for creating a dataset for ML/AI projects
>>#### 2. automate some of the tasks that are too lengthy or take up a lot of time
>>#### 3. for any reason you want, at this time learning it would appear pointless but there are a lot of benefits for doing this 

### 2. Why are we using Selenium driver for web scraping, but first what is selenium driver?
>#### 1. Selenium is a tool that is used to stimulate browsers, you can do it for various purposes, some examples are testing your web apps, for scraping, etc.
>#### 2. Why are we using selenium? This question only rises when you are aware that libraries do exist that can parse html and we can get raw html file through some inbuilt libraries in python.
Now this is a very important question and it would need another heading to explain it.

### 3. DOM elements of a site, API responses and other dynamic elements:
>#### 1. There are a lot of elements that are dynamic in nature i.e. they are not loaded as a part of html file but they are introduced later, this later may not be a lot later, but even then not along the html file
>#### 2. To see how it happens "inspect" your web page, go to "Networks" and then reload
>#### 3. problem with this is when we use any library to return the html of a site, it returns the content of a raw html file, it may contain script that is being used in the front-end but that is not useful in any way.
>#### 4. so, the only viable option that remains is to stimulate the browser, let the DOM elements load and then extract the data, this is precisely what we are going to do

### 4. if you want to see the differences that we are talking about, go to a site, there are two ways to see the related html 
>#### 1. use 'Inspect' by pressing F12 or Ctrl+Shift+I, it gives you the html with loaded DOM elements
>#### 2. Use 'Ctrl+U' to view the 'page source code' this is the raw file that we recieve without any driver

## 1. using Selenium web driver

### 1. Basic tasks:
>#### 1. open the Documentation of Selenium WebDriver https://www.selenium.dev/documentation/en/
>#### 2. install selenium:
>>#### 1. You have to install selenium write "pip install selenium" in your CMD/powershell/kernel
>#### 3. install ChromeDriver (hold on to this if not already done)
>#### 4. this notebook is available at https://github.com/am-a-man/project-web_scraper/blob/main/webscraper_presentation.ipynb if you have jupyter and are comfortable with it download it for your reference

Import all the libraries as specified below, as you make more projects you will get familier with all of them, if you have any doubts when operating with these libraries on your own, refer to their documentation

In [1]:
import os
from selenium import webdriver
import time

from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.webdriver.support.ui import WebDriverWait


### What are we going to learn in this workshop:
> 1. finding and specifying the element we need to specify
> 2. extracting the content from the element
> 3. extracting any attribute from the element (for example: you will be extracting href attribute from <a></a> tag)
> 4. Clicking on a element

Setup code:

In [4]:
os.environ['WDM_LOG_LEVEL'] = '0'
# path = 'c:/users/aman/download/chromedriver.exe'

chrome_options = webdriver.ChromeOptions()
driver = webdriver.Remote(
    command_executor='https://requip.herokuapp.com',
    options=chrome_options
)
# DRIVER =WEBDRIVER.fIREFOX(PATH)

search_url='https://careers.microsoft.com/students/us/en/search-results'
driver.get(search_url)
print(ChromeDriverManager())



WebDriverException: Message: <!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot POST /session</pre>
</body>
</html>



#### explaination of above lines:
> 1. os.environ['WDM_LOG_LEVEL'] = '0' : it is used to block any log messages from the operating system, we are dealing with stimulating a program, so working status logs are possible
2. driver = webdriver.Chrome(ChromeDriverManager().install()) : this statement install ChromeDriver, if you have it installed just use the path as an argument 'webdriver.Chrome(PATH)'
3. search_url='https://careers.microsoft.com/students/us/en/search-results' :this is the url that we are going to scrape
4. driver.get(search_url) : this is a API-get request to the site which returns the html of the website as a result

### 1. finding the element we need to specify:
> 1. run the setup code and inspect the site to look for the element:

In [None]:
"""the data we want to extract is the jobs list, when we inspect we can see that there is a 
"<ul></ul>" tag with class = 'jobs-list'
and each of the list items in "<li></li>" tag has class 'jobs-list-item'"""

1. now to use this element in our program:
2. there are any ways that we can do this:
    1. switch to documentation for this: https://selenium-python.readthedocs.io/locating-elements.html    
    2. know the difference between '..elements..' and '..element..'

The next few blocks have the same output:

In [3]:
path1 = driver.find_element_by_xpath("//ul[@data-ph-at-id='jobs-list']")
print(path1)
print(type(path1))
# this will return a Selenium webdrier element
# to see what is its content, use "element_name.text"
print(path1.text)

<selenium.webdriver.remote.webelement.WebElement (session="cc99fbd372b7f306dc6bd197c4ab253e", element="ffbde9ae-48e8-4770-9c2c-658f8e2a5751")>
<class 'selenium.webdriver.remote.webelement.WebElement'>
Internship Opportunities: Spreadsheet Experience and Technology
Cambridge, Cambridgeshire, United Kingdom
Research
Dec 8, 2020
The soul of the spreadsheet is the grid and its formulas.  Spreadsheets are the world’s most widely-used programming technology – but they also embody apparently-fundamental limitations.
Save
Internship Opportunities: Future of Work, User Experience/HCI Researcher
Cambridge, Cambridgeshire, United Kingdom
Research
Feb 1, 2021
We have an exciting opportunity for a User Experience/Human-Computer Interaction (HCI) Research intern to work with us at Microsoft Research Cambridge as part of the Future of Work team. This role is
Save
2022 Graduates Summer Intern - Program Manager Intern - C+AI - Beijing
Beijing, Beijing, China
Engineering
Feb 8, 2021
Microsoft mission is

In [4]:
path2 = driver.find_elements_by_xpath("//ul[@data-ph-at-id='jobs-list']")
print(type(path2))
print(path2.text)

<class 'list'>


AttributeError: 'list' object has no attribute 'text'

Explain the reason for above error:

In [5]:
path3 = path2[0].find_element_by_tag_name('li')
print(path3)
print(path3.text)

<selenium.webdriver.remote.webelement.WebElement (session="cc99fbd372b7f306dc6bd197c4ab253e", element="dd7e924c-6643-4d08-82c3-19dffa247eac")>
Internship Opportunities: Spreadsheet Experience and Technology
Cambridge, Cambridgeshire, United Kingdom
Research
Dec 8, 2020
The soul of the spreadsheet is the grid and its formulas.  Spreadsheets are the world’s most widely-used programming technology – but they also embody apparently-fundamental limitations.
Save


In [6]:
path4 = path2[0].find_elements_by_tag_name('li')
# remember path4 is a list
for i in path4:
    print("================================\n"+i.text+"\n===============================\n")

Internship Opportunities: Spreadsheet Experience and Technology
Cambridge, Cambridgeshire, United Kingdom
Research
Dec 8, 2020
The soul of the spreadsheet is the grid and its formulas.  Spreadsheets are the world’s most widely-used programming technology – but they also embody apparently-fundamental limitations.
Save

Internship Opportunities: Future of Work, User Experience/HCI Researcher
Cambridge, Cambridgeshire, United Kingdom
Research
Feb 1, 2021
We have an exciting opportunity for a User Experience/Human-Computer Interaction (HCI) Research intern to work with us at Microsoft Research Cambridge as part of the Future of Work team. This role is
Save

2022 Graduates Summer Intern - Program Manager Intern - C+AI - Beijing
Beijing, Beijing, China
Engineering
Feb 8, 2021
Microsoft mission is to, “Empower every person and every organization on the planet to achieve more,” Building on this mission, Cloud+ AI seeks to enable organizations in two core ways
Save

2022 Graduates Summer Intern

In [7]:
path4 = driver.find_element(By.XPATH, "//ul[@data-ph-at-id='jobs-list']//li[@data-ph-at-id='jobs-list-item']")
# this would give an error, what is that error?

NameError: name 'By' is not defined

In [8]:
from selenium.webdriver.common.by import By
path4 = driver.find_element(By.XPATH, "//ul[@data-ph-at-id='jobs-list']//li[@data-ph-at-id='jobs-list-item']")
print(path4.text)

Internship Opportunities: Spreadsheet Experience and Technology
Cambridge, Cambridgeshire, United Kingdom
Research
Dec 8, 2020
The soul of the spreadsheet is the grid and its formulas.  Spreadsheets are the world’s most widely-used programming technology – but they also embody apparently-fundamental limitations.
Save


#### perform the above using ".....elements..."

### 2. extracting the content from the element:


we have already used element_name.text to extract the content

### 3. extracting any attribute from the element (for example: you will be extracting href attribute from tag)

In [9]:
# linkElement = driver.find_element_by_xpath("//ul[@data-ph-at-id='jobs-list']//li[@data-ph-at-id='jobs-list-item']/a")
# above line will not work, why?
linkPath = driver.find_element_by_xpath("//ul[@data-ph-at-id='jobs-list']//li[@data-ph-at-id='jobs-list-item']")
linkElement = linkPath.find_element_by_tag_name('a')
link = linkElement.get_attribute('href')
print(link)

https://careers.microsoft.com/students/us/en/job/946737/Internship-Opportunities-Spreadsheet-Experience-and-Technology


### 4. clicking an element:


1. we can perform a lot of touch-actions in selenium for example: tap, double tap, flick, scroll, click, tap and hold, etc
2. the most useful of them is click(), so useful that it has a different implementation than others

In [10]:
linkElement.click()

Now we have all the different parts that is required to get the list of all jobs from the website specified, since this is a workshop we want you to code right now and if you encounter a problem that you are not able to solve on your own please ask