# Web Scraping using Selenium
- https://brightdata.com/blog/how-tos/scrape-dynamic-websites-python

In [6]:
#!pip install selenium
#!pip install webdriver_manager

In [7]:
# import libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

In [8]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [9]:
# Define the URL
url = "https://www.youtube.com/@programmingwithmosh/videos"

# load the web page
driver.get(url)

# set maximum time to load the web page in seconds
driver.implicitly_wait(10)

- Selenium automatically loads the YouTube link in the Chrome browser.
- Additionally, a time frame is specified (ie ten seconds) to make sure that the web page is fully loaded (including all HTML elements). This helps you scrape data that is rendered by JavaScript.
- Scrape Data Using ID and Tags
- One of the benefits of Selenium is that it can extract data using different elements presented on the web page, including the ID and tag.
- For instance, you can use either the ID element (ie post-title) or tags (ie h1 and p) to scrape the data:
- -<h1 id ="post-title">Introduction to data scrapping using Python</h1>
- <p>You can use selenium python package to collect data from any dynamic website</p>
- Use Webdriver to scrape data that is within the ID identified. To find an HTML element by ID attribute, call the find_element() Selenium method and pass By.ID as the first argument and ID as the second argument.
- To collect the video title and video link for each video, you need to use the video-title-link ID attribute. Since you’re going to collect multiple HTML elements with this ID attribute, you’ll need to use the find_elements() method:

In [10]:
# collect data that are withing the id of contents
contents = driver.find_element(By.ID, "contents")

#1 Get all the by video tite link using id video-title-link
video_elements = contents.find_elements(By.ID, "video-title-link")

#2 collect title and link for each youtube video
titles = []
links = []

for video in video_elements:

    #3 Extract the video title
    video_title = video.get_attribute("title")

    #4 append the video title
    titles.append(video_title)

    #5 Extract the video link
    video_link = video.get_attribute("href")

#6 append the video link
links.append(video_link)

- This code performs the following tasks:
  - It collects data that is within the ID attribute of contents.
  -  It collects all HTML elements that have an ID attribute of video-title-link from the WebElement contents object.
  - It creates two lists to append titles and links.
  - It extracts the video title using the get_attribute()method and passes the title.
  - It appends the video title in the titles list.
  - It extracts the video link using the get_atribute() method and passes href as an argument.
  - It appends the video link in the links list.
- At this point, all the video titles and links will be in two Python lists: titles and links.
- - Next, you need to scrape the link of the image that is available on the web page before you click the YouTube video link to watch the video. To scrape this image link, you need to find all the HTML elements by calling the find_elements() Selenium method and passing By.TAG_NAME as the first argument and the name of the tag as the second argument:

In [11]:
#1 Get all the by Tag
img_elements = contents.find_elements(By.TAG_NAME, "img")

#2 collect img link and link for each youtube video
img_links = []

for img in img_elements:

    #3 Extract the img link
    img_link = img.get_attribute("src")
    if img_link:
        #4 append the img link
        img_links.append(img_link)

- This code collects all the HTML elements with the img tag name from the WebElement object called contents. It also creates a list to append the image links and extracts it using the get_attribute() method and passes src as an argument. Finally, it appends the image link to the img_links list.
- You can also use the ID and the tag name to scrape more data for each YouTube video. On the web page of the YouTube URL, you should be able to see the number of views and the time published for each video listed on the page. To extract this data, you need to collect all the HTML elements that have an ID of metadata-line and then collect data from the HTML elements with a span tag name:

In [12]:
#1 find the element with the specific ID you want to scrape
meta_data_elements = contents.find_elements(By.ID, 'metadata-line')

#2 collect data from span tag
meta_data = []

for element in meta_data_elements:
    #3 collect span HTML element
    span_tags = element.find_elements(By.TAG_NAME, 'span')

    #4 collect span data
    span_data = []
    for span in span_tags:
        #5 extract data for each span HMTL element.
        span_data.append(span.text)
    #6 append span data to the list
    meta_data.append(span_data)

# print out the scraped data.
print(meta_data)

[['109K views', '1 day ago'], ['168K views', '7 days ago'], ['34K views', '3 weeks ago'], ['39K views', '1 month ago'], ['83K views', '3 months ago'], ['64K views', '4 months ago'], ['41K views', '4 months ago'], ['135K views', '5 months ago'], ['419K views', '6 months ago'], ['32K views', '6 months ago'], ['2.2M views', '1 year ago'], ['3.5M views', '1 year ago'], ['48K views', '1 year ago'], ['82K views', '1 year ago'], ['3.2M views', '1 year ago'], ['1M views', '1 year ago'], ['402K views', '1 year ago'], ['159K views', '2 years ago'], ['58K views', '2 years ago'], ['67K views', '2 years ago'], ['130K views', '2 years ago'], ['56K views', '2 years ago'], ['81K views', '2 years ago'], ['104K views', '2 years ago'], ['2.2M views', '2 years ago'], ['2.3M views', '2 years ago'], ['45K views', '2 years ago'], ['8.1M views', '3 years ago'], ['1.4M views', '3 years ago'], ['221K views', '3 years ago']]


- This code block collects all the HTML elements that have an ID attribute of metadata-line from the WebElement contents object and creates a list to append data from the span tag that will have the number of views and the time published.
- It also collects all the HTML elements whose tag name is span from the WebElement object called meta_data_elements and creates a list with this span data. Then it extracts the text data from the span HTML element and appends it to the span_data list. Finally, it appends the data from the span_data list to the meta_data.
- The data extracted from the span HTML element will look like this:
- - Next, you need to create two Python lists and save the number of views and time published separately:

In [17]:
#1 Iterate over the list of lists and collect the first and second item of each sublist
views_list = []
published_list = []

for sublist in meta_data:
    #2 append number of views in the views_list
    views_list.append(sublist[0])

    #3 append time published in the published_list
    published_list.append(sublist[1])

- Here, you create two Python lists that extract data from meta_data, and you append the number of views for each sublist to view_list and the time published for each sublist to the published_list.
- At this point, you’ve scraped the title of the video, the URL of the video page, the URL of the image, the number of views, and the time the video was published. This data can be saved into a pandas DataFrame using the pandas Python package. Use the following code to save the data from the list of titles, links, img_links, views_list, and published_list into the pandas DataFrame:

In [18]:
# save in pandas dataFrame
data = pd.DataFrame(
list(zip(titles, links, img_links, views_list, published_list)),
columns=['Title', 'Link', 'Img_Link', 'Views', 'Published'])

# show the top 10 rows
#data.head(10)

# export data into a csv file.
#data.to_csv("../data/youtube_data.csv",index=False)

driver.quit()

In [19]:
data

Unnamed: 0,Title,Link,Img_Link,Views,Published
0,Is Devin AI the end (or future) of coding?!,https://www.youtube.com/watch?v=Nb0btdq1164,https://i.ytimg.com/vi/XKkoVpupYdw/hqdefault_c...,109K views,1 day ago
