# **Data Collection Part 1**: Web Scraping <u>Real News</u>

This notebook demonstrates the first part of our data collection which is web scraping real news from well-known and considered as reliable news sites in the Philippines. 

The **news sites involved** in collecting real news are the following: 
1. [`GMA News`](https://www.gmanetwork.com/news/topstories/)
2. [`Rappler`](https://www.rappler.com/)

# Importing Libraries
To start, we will be importing libraries that would help us perform web scraping and data processing properly.

## Basic Libraries
*  [`time`](https://docs.python.org/3/library/time.html): provides functions for handling time related tasks 
> this allows us to enable pauses for the loops that we will run later on

*   [`pandas`](https://pandas.pydata.org/) : has functions for data analysis and manipulation
> this allows us to obtain and organize all scraped data in a format that can be used for Exploratory Data Analysis (EDA) and for creating a model for fake news detection

In [None]:
import time
import pandas as pd

## Web Scraping Library: `Selenium`
> The Selenium library will be used for  web scraping GMA and Rappler since navigation through multiple pages will be performed on both sites.

*    `from selenium import webdriver` enables automation of web browser
*    `from selenium.webdriver.common.keys import Keys` enables simulation of keyboard keys
*   `from selenium.webdriver.common.by import By` helps with locating elements in the webpage through XPath
*   `from selenium.webdriver.chrome.options import Options` enables us to specify certain preferences when initializing the WebDriver, aids in preventing multiple windows from initializing when performing the loops

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

# Preparations for Web Scraping

## Preparing [`Chrome WebDriver`](https://chromedriver.chromium.org/downloads) for Selenium Web Scraping

Since the needed libraries have already been imported, we will now **set up our Chrome WebDriver** before we start web scraping using the Selenium library. 

* [`Chrome WebDriver`](https://chromedriver.chromium.org/getting-started) is a driver used for the Selenium WebDriver to have control on the Chrome window during web scraping.

### 1. To start setting up the driver, download the drive from the [`ChromeDriver site`](https://chromedriver.chromium.org/downloads). 
It is recommended to put the downloaded file in a file location that you would be easy to locate it or in the same file location as the project. This is so that performing the next step would be easy for you.

### 2. Set the driver path 
This part necessitates the user to specify their own driver path in their device and store it in a variable, in this case, the driver path will be stored in the `driver_path` variable. This will enable scraping through initializing the Chrome WebDriver in the later step of the set up.

In [None]:
# `driver_path` : contains the path where the WebDriver is found
driver_path="/Users/beatricebanzon/Desktop/dlsu/col/2/T3 22.23/DATA102/chromedriver_mac_arm64/chromedriver.exe"

### 3. Specify Chrome Preferences
In this step, we will  we will specify opting to disable opening a visible GUI each time the subsequent loops are ran. With the help of the [`Options()`](https://chromedriver.chromium.org/capabilities), we would be able to specify our chrome preferences.

In [None]:
# `chrome_options` : handles the  specified chrome preferences
chrome_options = Options()

#  "--headless=new" : disables opening new windows or any GUI even if the browser is running
chrome_options.add_argument("--headless=new")

### 4. Initialize Chrome Driver

Aas for the last step for setting up the Chrome Driver, we will be needing the driver path and chrome preferences we specified from Steps 2 and 3. 

To recall, we have stored these values to the following variables:
* `driver_path` : contains the file path where your Chrome WebDriver is found
* `chrome_options` : contains the chrome preferences we specified from Step 3

Through inserting these two variables into the [`webdriver.Chrome()`](https://sites.google.com/chromium.org/driver/capabilities?authuser=0), we will now initialize the driver to be used later for web scraping and store it to a variable named `driver`.

In [None]:
# `driver`: contains the function for setting the Chrome's behavior once the selenium starts web scraping
driver = webdriver.Chrome(driver_path, chrome_options=chrome_options)

## Setting the Delay Between Scraping Requests
The `pause` variable holds the <u>**number of delay in seconds** between requests</u> during the whole process of web scraping.

In [None]:
pause = 3

# Web Scraping News Sites

This section is divided into two major parts which are: (1) Web Scraping GMA News and (2) Web Scraping Rappler. Both sections of web scraping will be performed with the help of Selenium.

## Web Scraping `GMA News`

### GMA News Pages
> The pages containing the links of GMA's articles will be scraped.

* gma_url : url of GMA containing the articles
* counter : used to limit the number of scraped pages
* gma_master : container for the links



The url below will be scraped - notably, it needs scrolling to load subsequent pages. In the said page, all the articles may be seen; thus, we start by getting the links of each article first.

In [None]:
gma_url=f"https://www.gmanetwork.com/news/archives/topstories/"
counter = 1
gma_master = []

In the following code, XPath is used to locate and obtain the elements from the
HTML page. The loop below will be ran as the site scrolls.

* '.find_elements' : for locating multiple elements (in this case, articles)
* '.get_attribute' : for retrieving a specific attribute inside the element (in this case, for getting the link found in each element)


In [None]:
driver.get(gma_url)

In [None]:
checker = True
while checker:
    url = driver.current_url
    print (gma_url)

    time.sleep(pause)

    # the ul contains all the stories or articles found in each page
    temp = driver.find_elements(By.XPATH, '//ul[@id="grid_thumbnail_stories"]')

    # all those with tag a under the previous ul with the specified class contains the href
    for each in temp[0].find_elements(By.XPATH, '//a[@class="story_link story"]'):
        link = each.get_attribute("href")
        if link not in gma_master:
            gma_master.append(link)

    # the GMA site needs scrolling to load the following pages
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)

    counter+=1

    if counter==1001:
        break

### GMA News Articles
The links obtained in the previous code have been appended to gma_master. Thus, in this part, each of those links will be scraped to obtain the content needed.

In [None]:
# creating the DataFrame with the needed content
gma_df = pd.DataFrame(columns=['Link', 'Author','Content'])

The loop below is used to get all the necessary information out of each article. Since some information has different XPaths per article, we make use of the try and except functions. All the obtained information is then transferred to the dataframe named gma_df which will then be used in data pre-processing.

In [None]:
# to know how many articles have already been scraped
scraped_count = 0

for link in gma_master:
    print (scraped_count, link)
    driver.get(link)

    time.sleep (pause)

    checker = True

    # to obtain the actual content of each article, we create a container for the paragraphs
    paragraphs = []

    # different articles have different XPaths for the whole content of the article
    while checker:
        try:
            content = driver.find_element(By.XPATH, '//div[@class="story_main"]')
        except:
            content = driver.find_element(By.XPATH, '//div[@class="article-body"]')

    # the following codes are needed to get only the text of those with tag p directly under content
        pars = content.find_elements(By.XPATH, 'p')
        for par in pars:
            text = par.text.strip()
            if text:
                paragraphs.append(text)
    # since there are several paragraphs in one article, there is a need to append
        concat_pars = ' '.join(paragraphs)

    # to get the author:
        try:
            author_elements = content.find_elements(By.XPATH, '//div[@class="main-byline"]')
            author = [element.text for element in author_elements]
        except:
            try:
                author_elements = content.find_elements(By.XPATH, '//div[@class="article-author"]')
                author = [element.text for element in author_elements]
            except:
                author = ["Author not found"]

        checker = False
        scraped_count += 1

    # appending all the scraped content into the DataFrame
    gma_df = gma_df.append({'Link': link,'Author':author, 'Content': concat_pars}, ignore_index=True)
    # storing everything into a CSV file to prevent loss
    gma_df.to_csv('gma_dataframe.csv', index=False)

## Web Scraping `Rappler`

The method used in scraping GMA is done with Rappler. Similarly, we obtain the links first before scraping each article.

### Rappler Pages
> In scraping Rappler's pages, there was a need to change the last part of the URL instead of scrolling (which was done on GMA's). Through the loop below, we are able to navugate through Rappler's different pages and use the appropriate XPaths after checking the site's HTML.

The following variables will be used:

* rap_master : container for the links
* checker : will be used for the following loops to run
* counter : used to limit the number of scraped pages

In [None]:
rap_master = []
checker = True
counter = 1

The following functions will be used to locate the links:
* '.find_elements' : for locating multiple elements (in this case, articles)
* '.get_attribute' : for retrieving a specific attribute inside the element (in this case, for getting the link found in each element)

In [None]:
while checker:
    url_rap=f"https://www.rappler.com/latest/page/{counter}/"
    print ('Scraping', url_rap)
    driver.get(url_rap)

    time.sleep(pause)

    # contains all the articles
    temp = driver.find_elements(By.XPATH, '//main[@id="primary"]')

    # to get all the links in the current page
    for each in temp[0].find_elements(By.XPATH, '//article//h2//a'):
        link = each.get_attribute("href")
        if link not in rap_master:
            rap_master.append(link)

    time.sleep(pause)
    # necessary to change the url
    counter+=1

    if counter==1200:
        break

### Rappler Articles

We start by  reating the DataFrame that will contain all the content from each article from Rappler.

In [None]:
rap_df = pd.DataFrame(columns=['Link', 'Author','Content'])

The loop below is used to get all the necessary information out of each article. Since some information has different XPaths per article, we make use of the try and except functions. All the obtained information is then transferred to the dataframe named rap_df which will then be used in data pre-processing.

In [None]:
# to know how many articles have already been scraped
scraped_count = 0

for link in rap_master:
    print (scraped_count, link)
    driver.get(link)

    time.sleep (pause)

    checker = True
    paragraphs = []
    while checker:
        # for main content
        content = driver.find_element(By.XPATH, '//div[@class="post-single__content entry-content"]')
        pars = content.find_elements(By.XPATH, 'p')
        for par in pars:
            text = par.text.strip()
            if text:
                paragraphs.append(text)
        # again, all the paragraphs have to be appended since not all the paragraphs are contained in a single tag
        concat_pars = ' '.join(paragraphs)
        checker = False

        # to obtain the author, the following XPath is used:
        try:
            author_elements = content.find_elements(By.XPATH, '//a[@class="post-single__author"]')
            author = [element.text for element in author_elements]
        except:
            author = ["Author not found"]

        scraped_count += 1

        # appending all the obtained content to the DataFrame
    rap_df = rap_df.append({'Link': link,'Author':author, 'Content': concat_pars}, ignore_index=True)
        # storing to prevent loss
    rap_df.to_csv('rap_dataframe.csv', index=False)