<img src="https://i.imgur.com/YxnDUwA.png" width=400>

Frequently we don't have access to any/enough data to perform accurate analysis, this is a common issue to a new/nich project. In those cases, we might need to find a way to collect data on our own.

__A Web Scraper__ is a program that extract data from a website.

### Problem Statement
- Build a Web Scraper to collect data about articles on [https://vnexpress.net/](https://vnexpress.net/).
- Required information:
  - Title
  - Description
  - Link to the Article
  - Link to Thumbnail Image (optional)

In [None]:
# install selenium and other resources for crawling data
!pip install selenium
!apt-get update
# install other resources for doing crawling
!apt install chromium-chromedriver

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting selenium
  Downloading selenium-4.3.0-py3-none-any.whl (981 kB)
[K     |████████████████████████████████| 981 kB 25.8 MB/s 
[?25hCollecting urllib3[secure,socks]~=1.26
  Downloading urllib3-1.26.10-py2.py3-none-any.whl (139 kB)
[K     |████████████████████████████████| 139 kB 71.9 MB/s 
[?25hCollecting trio~=0.17
  Downloading trio-0.21.0-py3-none-any.whl (358 kB)
[K     |████████████████████████████████| 358 kB 49.5 MB/s 
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting outcome
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.1.0-py3-none-any.whl (24 kB)
Collecting pyOpenSSL>=0.14
  Downloading p

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By

In [None]:
DRIVER = None

In [None]:
def initialize_driver():
    global DRIVER
    if DRIVER is None:
        print('Initiating driver...')
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('-headless') 
        chrome_options.add_argument('-no-sandbox') 
        DRIVER = webdriver.Chrome('chromedriver',options=chrome_options)  # Create the new chrome browser with specific options
        print('Finished!')

In [None]:
def close_driver():
    global DRIVER
    if DRIVER is not None:
        print('Quitting drive...')
        DRIVER.quit()
        print('Done')

    DRIVER = None

In [None]:
close_driver()

In [None]:
initialize_driver()

Initiating driver...
Finished!


In [None]:
DRIVER.get('https://vnexpress.net/')
DRIVER.current_url

'https://vnexpress.net/'

# 🔬 ANALYZE

💡 **item-news**

![](https://i.imgur.com/sI6Slxi.png)


In [None]:
all_news_elements = DRIVER.find_elements(By.CLASS_NAME, 'item-news')
len(all_news_elements)

184

To make it simple, we will do the **testing on 1 single article**. If this is sucessful, we will run the code on all the articles we found

In [None]:
news_element = all_news_elements[9]

In [None]:
print(news_element.get_attribute('outerHTML'))

<article class="item-news item-news-common off-thumb">
<a href="https://vnexpress.net/topic/cang-thang-nga-ukraine-25857" data-itm-source="#vn_source=Home&amp;vn_campaign=Banner-NgaTanCongUkraina&amp;vn_medium=Item-0&amp;vn_term=Desktop" title="Căng thẳng biên giới Ukraine - Nga" data-itm-added="1">
<img alt="Căng thẳng biên giới Ukraine - Nga" src="https://s1cdn.vnecdn.net/vnexpress/restruct/i/v626/v2_2019/pc/graphics/Nga-Ukraine-Pc.jpg">
</a>
</article>


We can see that all the information we need is inside `item-news` element
- Title
- Link to the article
- Description

Let's dive in

## Description

Within a description element, you can get the **article link**, **title** and **description**. Let's get them

In [None]:
description_element = news_element.find_elements(By.CLASS_NAME, 'description')

StaleElementReferenceException: ignored

Get article link:

In [None]:
a_element = news_element.find_element(By.TAG_NAME, 'a')
print(a_element)

<selenium.webdriver.remote.webelement.WebElement (session="6a317626ecb072f81cf2885e26a260bc", element="0a4d2548-94f3-4628-b463-9e8fda0669e8")>


In [None]:
article_link = a_element.get_attribute('href')
print(article_link)

https://vnexpress.net/topic/cang-thang-nga-ukraine-25857


Get title:

In [None]:
title = a_element.get_attribute('title')
print(title)

Căng thẳng biên giới Ukraine - Nga


Get description:

In [None]:
a_element.text

''

Alright, we were able to get 3 out of 4 things that we want. At this point, you should **write a function** for reusability

In [None]:
def get_link_title_description(news_element):
    '''
    Return link, title and description of an article from a web element
    '''

    description_element = news_element.find_element(By.CLASS_NAME, 'description')
    a_element           = description_element.find_element(By.TAG_NAME, 'a')

    # article link
    article_link = a_element.get_attribute('href')

    # title
    title = a_element.get_attribute('title')

    # description
    desc = a_element.text

    return article_link, title, desc

In [None]:
news_element = all_news_elements[9]

a, b, c = get_link_title_description(news_element)

StaleElementReferenceException: ignored

In [None]:
a, b, c

('https://vnexpress.net/gu-mac-thanh-lich-cua-truong-doan-bong-da-thai-lan-4463483.html',
 'Gu mặc thanh lịch của trưởng đoàn bóng đá Thái Lan',
 'Nualphan Lamsam, trưởng đoàn đội tuyển bóng đá Thái Lan, được các tạp chí thời trang nhận xét phong cách thanh lịch.')

## Thumb-art

Getting the thumbnail link is a little bit tricker, but it will be easier when we use the Inspection tool

In [None]:
news_element = all_news_elements[8]

In [None]:
thumbnail_link = news_element.find_element(By.TAG_NAME, 'img').get_attribute('src')
print(thumbnail_link)

https://vcdn1-kinhdoanh.vnecdn.net/2022/05/15/xangQuynhTran22022-1652547521-3876-1652547817.jpg?w=220&h=132&q=100&dpr=1&fit=crop&s=F2FIEYC0PKk1LIWQdi7_3w


In [None]:
# thumbart_class = news_element.find_element(By.TAG_NAME, 'thumb-art')

Of course we will package everything we just wrote into a function as well

In [None]:
def get_thumbnail_link(news_element):
    '''
    Return thumbnail link (if possible) given the web element
    '''
    
    thumbnail_link = ''
    try:
        thumbnail_link = news_element.find_element(By.TAG_NAME, 'img').get_attribute('src')
    except Exception:  # if there's an error
        print('Cannot find thumbnail_link')
    
    return thumbnail_link

In [None]:
print(get_thumbnail_link(news_element))

https://vcdn1-kinhdoanh.vnecdn.net/2022/05/15/xangQuynhTran22022-1652547521-3876-1652547817.jpg?w=220&h=132&q=100&dpr=1&fit=crop&s=F2FIEYC0PKk1LIWQdi7_3w


# 🏃🏻‍♂️ LAB

## Requirements

You are now required to write a complete program to scrape **all** the articles at VNExpress. 

- An valid article must have **at least 3** information:
  1. Title
  2. Description
  3. Link to the article
  4. Link to thumbnail image (if possible)

- Your function should also count the number of valid articles it found.

- Please use the functions we define above. We created them for a reason.

In [None]:
close_driver()
initialize_driver()

Quitting drive...
Done
Initiating driver...
Finished!


In [None]:
DRIVER.get('https://vnexpress.net/')

In [None]:
def scrape_vnexpress(DRIVER):

  all_news_elements = DRIVER.find_elements(By.CLASS_NAME, 'description')
  all_news_thumbnails = DRIVER.find_elements(By.CLASS_NAME, 'thumb-art')
  for i in all_news_elements:
    try:
      an_article = i.find_element(By.TAG_NAME, 'a')

      title = an_article.get_attribute('title')
      link = an_article.get_attribute('href')
      description = an_article.text

      print(title + "\n")
      print(link + "\n")
      print(description + "\n")
    finally:
      for x in all_news_thumbnails:        
        try:
          thumbnail_link = x.find_element(By.TAG_NAME, 'img').get_attribute('src')
          print(thumbnail_link)
        except Exception as e:
          print('Cannot find thumbnail_link')

          print("-"*30)


In [None]:
  results = scrape_vnexpress(DRIVER)

NameError: ignored

In [None]:
results[0:5]

TypeError: ignored

## Save results as csv file

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(results, columns=['link','title','description','thumbnail_link'])

In [None]:
df.head()

In [None]:
df.to_csv('vnexpress_scraped.csv', index=False)