# Importing Relevant Libraries

Before scraping real news, it is important to import a couple of libraried that will enable us to scrape and organize the content that will be obtained from the sites chosen - in this case, GMA and Rappler.

## Basic Libraries
We first start by importing libraries that are not necessarily used for scraping, but would enable us to do the task better.

*  `time`: provides functions for handling time related tasks ([Programiz](https://www.programiz.com/python-programming/time), n.d.)
> this allows us to enable pauses for the loops that we will run later on

*   `pandas` : has functions for data analysis and manipulation ([pandas](https://pandas.pydata.org/), n.d.)
> this allows us to obtain all scraped data and organize them in a format that can be used for EDA and to create the model later on




In [None]:
import time
import pandas as pd

## Selenium
> The following libraries are needed to scrape GMA and Rappler. Selenium is used since we navigate through multiple pages in both sites.

*    `from selenium import webdriver` enables automation of web browser
*    `from selenium.webdriver.common.keys import Keys` enables simulation of keyboard keys
*   `from selenium.webdriver.common.by import By` helps with locating elements in the webpage through XPath
*   `from selenium.webdriver.chrome.options import Options` enables us to specify certain preferences when initializing the WebDriver, aids in preventing multiple windows from initializing when performing the loops

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

# Preparation for Scraping
Since the needed libraries have already been imported, scraping can now be started. However, preliminaries should be done first. The following variables are needed:

* driver_path : contains the path where the WebDriver is found
* chrome_options : added to skip having displays / new windows while the code is running

Additionally, the user has to specify the download and specify the driver path to be able to run subsequent codes.




This part necessitates the user to specify their own driver path that will enable scraping through initializing the Chrome WebDriver.

In [None]:
# insert your own driver path below:
driver_path="/Users/beatricebanzon/Desktop/dlsu/col/2/T3 22.23/DATA102/chromedriver_mac_arm64/chromedriver.exe"

As mentioned earlier, the Options function is used to specify preferences. In this case, we specify opting to disable opening a visible GUI each time the subsequent loops are ran.

In [None]:
chrome_options = Options()
chrome_options.add_argument("--headless=new")
driver=webdriver.Chrome(driver_path, chrome_options=chrome_options)

Other variables shall be specified, with the following purposes:
* pause : number of delay in seconds
* checker : will be used for the following loops to run

In [None]:
pause = 3
checker = True

## Scraping GMA

### GMA Pages
> The pages containing the links of GMA's articles will be scraped.

* gma_url : url of GMA containing the articles
* counter : used to limit the number of scraped pages
* gma_master : container for the links



The url below will be scraped - notably, it needs scrolling to load subsequent pages. In the said page, all the articles may be seen; thus, we start by getting the links of each article first.

In [None]:
gma_url=f"https://www.gmanetwork.com/news/archives/topstories/"
counter = 1
gma_master = []

In the following code, XPath is used to locate and obtain the elements from the
HTML page. The loop below will be ran as the site scrolls.

* '.find_elements' : for locating multiple elements (in this case, articles)
* '.get_attribute' : for retrieving a specific attribute inside the element (in this case, for getting the link found in each element)


In [None]:
driver.get(gma_url)

In [None]:
while checker:
    url = driver.current_url
    print (gma_url)

    time.sleep(pause)

    # the ul contains all the stories or articles found in each page
    temp = driver.find_elements(By.XPATH, '//ul[@id="grid_thumbnail_stories"]')

    # all those with tag a under the previous ul with the specified class contains the href
    for each in temp[0].find_elements(By.XPATH, '//a[@class="story_link story"]'):
        link = each.get_attribute("href")
        if link not in gma_master:
            gma_master.append(link)

    # the GMA site needs scrolling to load the following pages
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)

    counter+=1

    if counter==1001:
        break

### GMA Articles
The links obtained in the previous code have been appended to gma_master. Thus, in this part, each of those links will be scraped to obtain the content needed.

In [None]:
# creating the DataFrame with the needed content
gma_df = pd.DataFrame(columns=['Link', 'Author','Content'])

The loop below is used to get all the necessary information out of each article. Since some information has different XPaths per article, we make use of the try and except functions. All the obtained information is then transferred to the dataframe named gma_df which will then be used in data pre-processing.

In [None]:
# to know how many articles have already been scraped
scraped_count = 0

for link in gma_master:
    print (scraped_count, link)
    driver.get(link)

    time.sleep (pause)

    checker = True

    # to obtain the actual content of each article, we create a container for the paragraphs
    paragraphs = []

    # different articles have different XPaths for the whole content of the article
    while checker:
        try:
            content = driver.find_element(By.XPATH, '//div[@class="story_main"]')
        except:
            content = driver.find_element(By.XPATH, '//div[@class="article-body"]')

    # the following codes are needed to get only the text of those with tag p directly under content
        pars = content.find_elements(By.XPATH, 'p')
        for par in pars:
            text = par.text.strip()
            if text:
                paragraphs.append(text)
    # since there are several paragraphs in one article, there is a need to append
        concat_pars = ' '.join(paragraphs)

    # to get the author:
        try:
            author_elements = content.find_elements(By.XPATH, '//div[@class="main-byline"]')
            author = [element.text for element in author_elements]
        except:
            try:
                author_elements = content.find_elements(By.XPATH, '//div[@class="article-author"]')
                author = [element.text for element in author_elements]
            except:
                author = ["Author not found"]

        checker = False
        scraped_count += 1

    # appending all the scraped content into the DataFrame
    gma_df = gma_df.append({'Link': link,'Author':author, 'Content': concat_pars}, ignore_index=True)
    # storing everything into a CSV file to prevent loss
    gma_df.to_csv('gma_dataframe.csv', index=False)

## Scraping Rappler

The method used in scraping GMA is done with Rappler. Similarly, we obtain the links first before scraping each article.

### Rappler Pages
> In scraping Rappler's pages, there was a need to change the last part of the URL instead of scrolling (which was done on GMA's). Through the loop below, we are able to navugate through Rappler's different pages and use the appropriate XPaths after checking the site's HTML.

The following variables will be used:

* rap_master : container for the links
* checker : will be used for the following loops to run
* counter : used to limit the number of scraped pages

In [None]:
rap_master = []
checker = True
counter = 1

The following functions will be used to locate the links:
* '.find_elements' : for locating multiple elements (in this case, articles)
* '.get_attribute' : for retrieving a specific attribute inside the element (in this case, for getting the link found in each element)

In [None]:
while checker:
    url_rap=f"https://www.rappler.com/latest/page/{counter}/"
    print ('Scraping', url_rap)
    driver.get(url_rap)

    time.sleep(pause)

    # contains all the articles
    temp = driver.find_elements(By.XPATH, '//main[@id="primary"]')

    # to get all the links in the current page
    for each in temp[0].find_elements(By.XPATH, '//article//h2//a'):
        link = each.get_attribute("href")
        if link not in rap_master:
            rap_master.append(link)

    time.sleep(pause)
    # necessary to change the url
    counter+=1

    if counter==1200:
        break

### Rappler Articles

We start by  reating the DataFrame that will contain all the content from each article from Rappler.

In [None]:
rap_df = pd.DataFrame(columns=['Link', 'Author','Content'])

The loop below is used to get all the necessary information out of each article. Since some information has different XPaths per article, we make use of the try and except functions. All the obtained information is then transferred to the dataframe named rap_df which will then be used in data pre-processing.

In [None]:
# to know how many articles have already been scraped
scraped_count = 0

for link in rap_master:
    print (scraped_count, link)
    driver.get(link)

    time.sleep (pause)

    checker = True
    paragraphs = []
    while checker:
        # for main content
        content = driver.find_element(By.XPATH, '//div[@class="post-single__content entry-content"]')
        pars = content.find_elements(By.XPATH, 'p')
        for par in pars:
            text = par.text.strip()
            if text:
                paragraphs.append(text)
        # again, all the paragraphs have to be appended since not all the paragraphs are contained in a single tag
        concat_pars = ' '.join(paragraphs)
        checker = False

        # to obtain the author, the following XPath is used:
        try:
            author_elements = content.find_elements(By.XPATH, '//a[@class="post-single__author"]')
            author = [element.text for element in author_elements]
        except:
            author = ["Author not found"]

        scraped_count += 1

        # appending all the obtained content to the DataFrame
    rap_df = rap_df.append({'Link': link,'Author':author, 'Content': concat_pars}, ignore_index=True)
        # storing to prevent loss
    rap_df.to_csv('rap_dataframe.csv', index=False)