# **Data Collection Part 1**: Web Scraping <u>Real News</u>

This notebook demonstrates the first part of this project's data collection which is web scraping real news from well-known and considered as reliable news sites in the Philippines. 

The **news sites involved** in collecting real news are the following: 
1. [`GMA News`](https://www.gmanetwork.com/news/topstories/)
2. [`Rappler`](https://www.rappler.com/)

# Importing Libraries
To start, we will be importing libraries that would help us perform web scraping and data processing properly.

## Basic Libraries
*  [`time`](https://docs.python.org/3/library/time.html): provides functions for handling time related tasks 
> this allows us to enable pauses for the loops that we will run later on

*   [`pandas`](https://pandas.pydata.org/) : has functions for data analysis and manipulation
> this allows us to obtain and organize all scraped data in a format that can be used for Exploratory Data Analysis (EDA) and for creating a model for fake news detection

In [None]:
import time
import pandas as pd

## Web Scraping Library: [`Selenium`](https://selenium-python.readthedocs.io/index.html)
> The [`Selenium`](https://selenium-python.readthedocs.io/index.html) library will be used for  web scraping GMA and Rappler since navigation through multiple pages will be performed on both sites.

*    `from selenium import webdriver` enables automation of web browser
*    `from selenium.webdriver.common.keys import Keys` enables simulation of keyboard keys
*   `from selenium.webdriver.common.by import By` helps with locating elements in the webpage through XPath
*   `from selenium.webdriver.chrome.options import Options` enables us to specify certain preferences when initializing the WebDriver, aids in preventing multiple windows from initializing when performing the loops

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

# Preparing [`Chrome WebDriver`](https://chromedriver.chromium.org/downloads) for Selenium Web Scraping

Since the needed libraries have already been imported, we will now **set up our Chrome WebDriver** before we start web scraping using the Selenium library. 

* [`Chrome WebDriver`](https://chromedriver.chromium.org/getting-started) is a driver used for the Selenium WebDriver to have control on the Chrome window during web scraping.

## 1. To start setting up the driver, download the drive from the [`ChromeDriver site`](https://chromedriver.chromium.org/downloads). 
It is recommended to put the downloaded file in a file location that you would be easy to locate it or in the same file location as the project. This is so that performing the next step would be easy for you.

## 2. Set the driver path 
This part necessitates the user to specify their own driver path in their device and store it in a variable, in this case, the driver path will be stored in the `driver_path` variable. This will enable scraping through initializing the Chrome WebDriver in the later step of the set up.

In [None]:
# `driver_path` : contains the path where the WebDriver is found
driver_path="/Users/ajmarcelo/Downloads/chromedriver_mac64/chromedriver"

## 3. Specify Chrome Preferences
In this step, we will  we will specify opting to disable opening a visible GUI each time the subsequent loops are ran. With the help of the [`Options()`](https://chromedriver.chromium.org/capabilities), we would be able to specify our chrome preferences.

In [None]:
# `chrome_options` : handles the  specified chrome preferences
chrome_options = Options()

#  "--headless=new" : disables opening new windows or any GUI even if the browser is running
chrome_options.add_argument("--headless=new")

## 4. Initialize Chrome Driver

Aas for the last step for setting up the Chrome Driver, we will be needing the driver path and chrome preferences we specified from Steps 2 and 3. 

To recall, we have stored these values to the following variables:
* `driver_path` : contains the file path where your Chrome WebDriver is found
* `chrome_options` : contains the chrome preferences we specified from Step 3

Through inserting these two variables into the [`webdriver.Chrome()`](https://sites.google.com/chromium.org/driver/capabilities?authuser=0), we will now initialize the driver to be used later for web scraping and store it to a variable named `driver`.

In [None]:
# `driver`: contains the function for setting the Chrome's behavior once the selenium starts web scraping
driver = webdriver.Chrome(driver_path, chrome_options=chrome_options)

# Web Scraping News Sites

This section is divided into two major parts which are: (1) `Web Scraping GMA News` and (2) `Web Scraping Rappler`. Both web scraping will be performed with the help of **Selenium**.

## **Part 1.1.** `GMA News` Data
GMA is widely known in the Philippines for **disseminating news** and **producing entertainment shows**.
In a Reuters study in 2021, GMA Network was found to have the **highest news brand trust score** (Gonzales, 2021). The network also ranked **first in having the widest reach online** through their site GMA News Online (Chua, 2023).

### Web Scraping GMA News Pages
This section will demonstrate the process of web scraping the news articles from the GMA News site. Specifically, the **GMA News page containing a list of its articles** will be web scraped.

The url below will be scraped - notably, it needs scrolling to load subsequent pages. In the said page, all the articles may be seen; thus, we start by getting the **links of each article** first.

In [None]:
# `gma_url` : handles the URL of the GMA News site page that contains the articles
gma_url = f"https://www.gmanetwork.com/news/archives/topstories/"

Next, the variables declared in the cell below will be used in the web scraping process. Kindly refer to the comments on the purpose of each variable.

In [None]:
# `counter` : tracks the count of articles that are being scraped from the site
#           : will be used to limit the number of scraped pages (limit = 1001)
# `gma_master` : contains the GMA news article links that will be retrieved through scraping
# `pause` : holds the number of delay in seconds between requests during the whole process of web scraping

counter = 1
gma_master = []
pause = 3

Now, we will be using the `.get()` to load and access the full GMA News site page. To check what the function actually does, you may *uncomment and run the cell below*.

In [None]:
# driver.get?

Specifying the URL we want to load and access, we will insert the GMA News URL link we have set in one of the previous cells into the `.get()`.

In [None]:
# `gma_url` : handles the URL of the GMA News site page that contains the articles
driver.get(gma_url)

In the following code, [`XPath`](https://www.guru99.com/xpath-selenium.html) is used to **locate and obtain the elements from the HTML page**. The loop below will be ran as the site scrolls.

* [`.find_elements()`](https://selenium-python.readthedocs.io/locating-elements.html) : for locating multiple elements (in this case, articles)
* [`.get_attribute()`](https://selenium-python.readthedocs.io/api.html?highlight=.get_attribute#selenium.webdriver.remote.webelement.WebElement.get_attribute) : for retrieving a specific attribute inside the element (in this case, for getting the link found in each element)


In [None]:
# `checker` : `True` until it reaches the quantity limit of the scraped pages (limit = 1001)
# `counter` : tracks the number of scraped GMA news pages
#           : will be used to limit the number of scraped pages (limit = 1001)
# `gma_master` : contains the GMA news article links that will be retrieved through scraping
# `pause` : holds the number of delay in seconds between requests during the whole process of web scraping

checker = True

while checker:
    
    # retrieving the url that we are currently accessing
    url = driver.current_url 
    print (url)

    # applying a 3-second delay between requests
    time.sleep(pause) 

    # the `ul` contains all the stories or articles found in each page
    temp = driver.find_elements(By.XPATH, '//ul[@id="grid_thumbnail_stories"]')

    # all those with tag `a` under the previous `ul` with the specified class contains the href
    for each in temp[0].find_elements(By.XPATH, '//a[@class="story_link story"]'):
        # retrieves the href attribute or the link from the tag `a` in the HTML
        link = each.get_attribute("href")
        # checks if the retrieved link already exists in the `gma_master` 
        if link not in gma_master:
            # appends the link to `gma_master` if the link is not on the list yet
            gma_master.append(link)

    # for scrolling in the GMA site to load the following pages
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)
    
    # counter increments every after all article links in a GMA News site page is web scraped
    counter+=1
    
    # checks if the limit number of pages (1001) to be web scraped is already reached
    if counter == 1001:
        # turns `checker` to False if it reaches the limit, stopping the web scraping loop
        checker = False

### Processing the Web Scraped GMA News Article Links

The links obtained in the previous code have been appended to `gma_master`. Thus, in this part, each of those article links will be scraped to obtain the article content needed.

To start, we will be creating a pandas DataFrame using [`DataFrame()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe will be created with the following columns, representing the article information we will be retrieving from each of the articles: 
1. `Link` - article link
2. `Author` - author/s of the article
3. `Content` - the whole article itself

In [None]:
# `gma_df` : a DataFrame for storing the GMA news articles and its specific details

gma_df = pd.DataFrame(columns=['Link', 'Author','Content'])

Next, the loop below is used to **get all the necessary information out of each article**. 

Since some information has <u>different XPaths per article</u>, we make use of the *try and except* functions. 

All the obtained information is then transferred to the dataframe named `gma_df` which will then be used in data pre-processing later on.

In [None]:
# `scraped_count` : counter for articles that have already been scraped
scraped_count = 0

# exploring each article link that was stored in `gma_master`
for link in gma_master:
    print (scraped_count, link)
    
    # accesses the specific page of a GMA article
    driver.get(link)
    
    # applying the 3-second delay between requests
    time.sleep (pause)

    checker = True

    # to obtain the actual content of each article, we create a container for the paragraphs
    paragraphs = []

    # different articles have different XPaths for the whole content of the article
    while checker:
        try:
            content = driver.find_element(By.XPATH, '//div[@class="story_main"]')
        except:
            content = driver.find_element(By.XPATH, '//div[@class="article-body"]')

    # the following codes are needed to get only the text of those with tag `p` directly under content
        pars = content.find_elements(By.XPATH, 'p')
        for par in pars:
            text = par.text.strip()
            if text:
                paragraphs.append(text)
    
    # since there are several paragraphs in one article, there is a need to append the paragraphs
        concat_pars = ' '.join(paragraphs)

    # to get the author:
        try:
            author_elements = content.find_elements(By.XPATH, '//div[@class="main-byline"]')
            author = [element.text for element in author_elements]
        except:
            try:
                author_elements = content.find_elements(By.XPATH, '//div[@class="article-author"]')
                author = [element.text for element in author_elements]
            except:
                author = ["Author not found"]

        checker = False
        
    # increments every after an article is web scraped
    scraped_count += 1

    # appending all the scraped content into the DataFrame
    gma_df = gma_df.append({'Link': link,'Author':author, 'Content': concat_pars}, ignore_index=True)

### Saving the GMA News Article Data to CSV
With the help of pandas' [`to_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html), we will be exporting the web scraped articles stored in `gma_df` to CSV with the filename, **`gma_dataframe.csv`**. 

In [None]:
gma_df.to_csv('gma_dataframe.csv', index=False)

## Part 1.2. `Rappler` News Data

Rappler is said to have an **extensive audience** despite it being controversial due to the Duterte administration (Chua, 2023). The online news website focuses on **investigative journalism**, and has become one of the **most notable news sources** in the Philippines (Britannica, n.d.).

The same process used in web scraping the GMA News data will also be done with Rappler. Similarly, we obtain the links first before scraping each article.

### Web Scraping Rappler Pages

This section will demonstrate the process of web scraping the news articles from the Rappler site. In scraping Rappler's pages, there was a need to <u>change the last part of the URL</u> instead of scrolling (which was done on GMA's). Through the loop below, we are able to **navigate through Rappler's different pages** and use the <u>appropriate XPaths after checking the site's HTML</u>.

In [None]:
# `rap_master` : contains the Rappler article links that will be retrieved through scraping
# `checker` : True` until it reaches the quantity limit of the scraped pages (limit = 1200)
# `counter` : tracks the number of scraped Rappler pages
#           : will be used to limit the number of scraped pages (limit = 1200)
rap_master = []
checker = True
counter = 1

Similar with scraping the GMA News site,  [`XPath`](https://www.guru99.com/xpath-selenium.html) is used to **locate and obtain the elements from the HTML page**. To recall, these same functions will be used to locate the links:
* [`.find_elements()`](https://selenium-python.readthedocs.io/locating-elements.html) : for locating multiple elements (in this case, articles)
* [`.get_attribute()`](https://selenium-python.readthedocs.io/api.html?highlight=.get_attribute#selenium.webdriver.remote.webelement.WebElement.get_attribute) : for retrieving a specific attribute inside the element (in this case, for getting the link found in each element)

In [None]:
## `checker` : `True` until it reaches the quantity limit of the scraped pages (limit = 1200)
# `counter` : tracks the number of scraped Rappler pages
#           : will be used to limit the number of scraped pages (limit = 1200)
#           : indicator to change the url or the site page number to be web scraped next
# `rap_master` : contains the Rappler article links that will be retrieved through scraping
# `pause` : holds the number of delay in seconds between requests during the whole process of web scraping

while checker:
    # acessing the Rappler page containing list of articles to be web scraped later
    url_rap=f"https://www.rappler.com/latest/page/{counter}/"
    print ('Scraping', url_rap)
    driver.get(url_rap)
    
    # applying the 3-second delay
    time.sleep(pause)

    # contains all the article links
    temp = driver.find_elements(By.XPATH, '//main[@id="primary"]')

    # to get all the links in the current page through the href attribute of each tag `a` in the page
    for each in temp[0].find_elements(By.XPATH, '//article//h2//a'):
        link = each.get_attribute("href")
        if link not in rap_master:
            rap_master.append(link)
    
    # applying the 3-second delay 
    time.sleep(pause)
    
    # counter increments every after a Rappler page is web scraped
    # necessary, this is also the indicator to change the url or the page to be web scraped next
    counter+=1

    # checks if the limit number of pages (1200) to be web scraped is already reached
    if counter == 1200:
        # turns `checker` to False if it reaches the limit, stopping the web scraping loop
        checker = False

### Processing the Web Scraped Rappler Article Links

Similar to the processing of GMA News' article links, we start by creating the DataFrame that will contain the necessary information from each article from Rappler. This time, we will store the article information to the `rap_df` pandas dataframe.

In [None]:
# `rap_df` : a DataFrame for storing the Rappler articles and its specific details

rap_df = pd.DataFrame(columns=['Link', 'Author','Content'])

From this, the loop below will be used **to extract the necessary information from each article**. 

Since some information has <u>distinct XPaths per article</u>, we will be using the *try and except* functions for a smooth process of web scraping. 

After all of the web scraping processes per article, the gathered article information will then be stored to `rap_df` and will be used in data pre-processing later on.

In [None]:
# `scraped_count` : counter for articles that have already been scraped
scraped_count = 0

# exploring each article link that was stored in `rap_master`
for link in rap_master:
    print (scraped_count, link)
    
    # accesses the specific page of a Rappler article
    driver.get(link)
    
    # applying the 3-second delay between requests
    time.sleep (pause)

    checker = True
    
    # to obtain the actual content of each article, we create a container for the paragraphs
    paragraphs = []
    
    while checker:
        # for main content
        content = driver.find_element(By.XPATH, '//div[@class="post-single__content entry-content"]')
        pars = content.find_elements(By.XPATH, 'p')
        for par in pars:
            text = par.text.strip()
            if text:
                paragraphs.append(text)
        
        # again, all the paragraphs have to be appended since not all the paragraphs are contained in a single tag
        concat_pars = ' '.join(paragraphs)

        # to obtain the author, the following XPath is used:
        try:
            author_elements = content.find_elements(By.XPATH, '//a[@class="post-single__author"]')
            author = [element.text for element in author_elements]
        except:
            author = ["Author not found"]
        
        checker = False
    
    # increments every after an article is web scraped 
    scraped_count += 1

    # appending all the obtained content to the DataFrame
    rap_df = rap_df.append({'Link': link,'Author':author, 'Content': concat_pars}, ignore_index=True)

### Saving the Rappler News Article Data to CSV
Using pandas' [`to_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html), we will export the web scraped articles stored in `rap_df` to CSV with **`rap_dataframe.csv`** as its filename. 

In [None]:
rap_df.to_csv('rap_dataframe.csv', index=False)

# References

Britannica. (n.d.) *Rappler*. https://www.britannica.com/topic/Rappler

Gonzales, G. (2021). *Trust in news from social media decreases, ABS-CBN reach tumbles in 2021 Reuters study*.
https://www.rappler.com/technology/social-media-news-trust-decreases-filipinos-2021-reuters-digital-news-report/

Chua, Y. (2021). *Pandemic increases trust in news among Filipinos - digital news report 2021*. 
https://news.abs-cbn.com/spotlight/06/23/21/internet-use-philippines-digital-news-2021

Chua, Y. (2023). *Philippines*.
https://reutersinstitute.politics.ox.ac.uk/digital-news-report/2023/philippines