# **Data Collection Part 2**: Web Scraping <u>Fake News</u>

This notebook demonstrates the first part of this project's data collection which is web scraping fake news from news sites in the Philippines. 

The **news sites involved** in collecting real news are the following: 
1. [`Ako'y Pilipino`](https://www.gmanetwork.com/news/topstories/)
2. [`Maharlika News`](https://www.rappler.com/)

# Importing Libraries
To start, we will be importing libraries that would help us perform web scraping and data processing properly.

## Basic Libraries
*  [`requests`](https://pypi.org/project/requests/): has functions for HTTP requests 
> allows us to make requests to a web page and access its contents

*   [`pandas`](https://pandas.pydata.org/) : has functions for data analysis and manipulation
> this allows us to obtain and organize all scraped data in a format that can be used for Exploratory Data Analysis (EDA) and for creating a model for fake news detection

In [None]:
import requests
import pandas as pd

## Web Scraping Library: [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

> The [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library will be used for web scraping fake news from Ako'y Filipino and Maharlika News since we will only be scraping 1 page at a time on both sites.

In [None]:
from bs4 import BeautifulSoup

# Web Scraping Fake News Sites

In this section, there will be two primary sections which are (1) `Web Scraping Ako'y Pilipino News` and (2) `Web Scraping Maharlika News`. Both instances of web scraping will be executed using the assistance of the **BeautifulSoup** Library.

## Part 2.1. Web Scraping Ako'y Pilipino Articles

- [TO DO] Ako'y Pilipino Articles Definition
- [TO DO] Screenshot of Ako'y Pilipino Articles Site Page that will be web scraped


In web scraping fake news articles from Ako'y Pilipino, a function named `web_scrape_akoy_pilipino()` was created to web scrape articles from the **Local** and **International news** found in this site.

`web_scrape_akoy_pilipino(url)`
> To perform web scraping using this function, the <u>page url to be web scraped</u> would be inputted on the `url` parameter of the function. To understand the **process of web scraping the page through BeautifulSoup**, kindly refer to the <u>step-by-step comments</u> inside the function.

In [None]:
def web_scrape_akoy_pilipino(url):
    # Send a GET request to the website
    response = requests.get(url)

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Find the HTML elements that contain the fake news articles
    articles = soup.find_all("article")

    # Create empty lists to store the data
    titles = []
    times = []
    authors = []
    contents = []

    # Extract information from each article
    for article in articles:
        # Get the title of the article
        title = article.find('h2', class_='post-title entry-title')
        title = title.find('a', class_='').text.strip()
        titles.append(title)

        # Get the publication date of the article
        time = article.find('div', id='meta-post').text.strip()
        times.append(time)

        # Get the preview description of the article
        content = article.find('div', class_='entry').text.strip()
        contents.append(content)


    # Create a DataFrame to store the data
    data = pd.DataFrame({
        'Title': titles,
        'Date Posted': times,
        'Content': contents
    })

    # Print the data
    print(data)
    
    return data

### Web Scraping `Local` News from Ako'y Pilipino

First, the `local_url` contains the page link of the **Local News tab** of the Ako'y Pilipino site. Then, `local_url` will be inputted as the URL parameter of the `web_scrape_akoy_pilipino()`. In this part of the collection, ***only local news articles*** will be retrieved from the site.

In [None]:
# `local_url` : handles the URL of the International News tab of the Ako'y Pilipino
# `data_ap` : handles the web scraped local articles data

local_url = 'http://akoy-pilipino.blogspot.com/search/label/Local%20News'

data_ap = web_scrape_akoy_pilipino(local_url)

### Web Scraping `International` News from Ako'y Pilipino

Lastly, the `int_url` pertains to the webpage link of the **International News tab** on the Ako'y Pilipino website. Inserting the `int_url` into the URL parameter of the `web_scrape_akoy_pilipino()`, ***only articles under international news*** will be web scraped from the website.

In [None]:
# `int_url` : handles the URL of the International News tab of the Ako'y Pilipino
# `data_ap_2` : handles the web scraped international articles data

int_url = 'http://akoy-pilipino.blogspot.com/search/label/International%20News'

data_ap_2 = web_scrape_akoy_pilipino(int_url)

## Part 2.2. Web Scraping `Maharlika News` Articles


- [TO DO] Maharlika News Articles Definition
- [TO DO] Screenshot of Maharlika News Articles Site Page that will be web scraped


In this part of the collection, a function labeled `web_scrape_maharlika_news()` was formulated for the purpose of extracting fake news articles from Maharlika News through web scraping. Similar to `web_scrape_akoy_pilipino()` from the previous section, this function was designed to scrape articles present on the website. More specifically, we intend to scrape articles from the following categories within the website:<br>
>**1.** World of Business<br>
>**2.** Technology<br>
>**3.** Philippine News<br>
>**4.** World News<br>


`web_scrape_maharlika_news(url)`
> The <u>page url to be web scraped</u> would be inputted on the `url` parameter of the function to execute the web scraping. To understand the **process of web scraping the page through BeautifulSoup and through this function**, kindly read the <u>step-by-step comments</u> within the function.

In [None]:
def web_scrape_maharlika_news(url):
    # Send a GET request to the website
    response = requests.get(url)

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Find the HTML elements that contain the fake news articles
    articles = soup.find_all("article")
    
    # Create empty lists to store the data
    titles = []
    times = []
    authors = []
    contents = []
    
    # Extract information from each article
    for article in articles:
        # Get the title of the article
        title = article.find('h2', class_='post-box-title').text.strip()
        titles.append(title)

        # Get the publication date of the article
        time = article.find('p', class_='post-meta').text.strip()
        times.append(time)

        # Get the preview description of the article
        content = article.find('div', class_='entry').text.strip()
        contents.append(content)

    # Create a DataFrame to store the data
    data = pd.DataFrame({
        'Title': titles,
        'Date Posted': times,
        'Content': contents
    })

    # Print the data
    print(data)
    
    return data

### Web Scraping `Business` News from Maharlika News

In this section, the `bus_url` will handle the page link of the **Business tab** of the Maharlika News site. Then, the  `bus_url` will be inputted to the function url parameter of the `web_scrape_maharlika_news()`. From this, ***only business news articles*** will be retrieved from the site.

In [None]:
# `bus_url` : handles the URL of the Business tab of the Maharlika News
# `data_maharlika` : handles the web scraped business articles data

bus_url = 'https://www.maharlikanews.com/category/world-of-business/'

data_maharlika = web_scrape_maharlika_news(bus_url)

### Web Scraping `Techonology` News from Maharlika News

This time,  `tech_url` will handle the page link of the **Techonology tab** of the Maharlika News site and will be inputted into the the `web_scrape_maharlika_news()`. This would mean that this part of the scraping will ***only include techonology news articles*** from the site.

In [None]:
# `tech_url` : handles the URL of the Technology tab of the Maharlika News
# `data_maharlika_2` : handles the web scraped technology articles data

tech_url = 'https://www.maharlikanews.com/category/technology/'

data_maharlika_2 = web_scrape_maharlika_news(tech_url)

### Web Scraping `Philippine News` from Maharlika News

Next, the page link of the **Philippine News under the Maharlika News tab** of the site will be stored in `ph_url`. From this, the `ph_url` will be passed to the `web_scrape_maharlika_news()` to ***only include Philippine news articles*** in web scraping the site specifically in this part of the process.

In [None]:
# `ph_url` : handles the URL of the Philippines News tab of the Maharlika News
# `data_maharlika_3` : handles the web scraped Philippine news articles data

ph_url = 'https://www.maharlikanews.com/category/in-the-news/philippine-news/'

data_maharlika_3 = web_scrape_maharlika_news(ph_url)

### Web Scraping `World News` from Maharlika News

Finally, the `world_url` will handle the page link of the **World News under the Maharlika News tab** of the site. Then, by running the `web_scrape_maharlika_news()` function with the `world_url` value assigned to its URL parameter, ***only Philippine news articles*** will be retrieved from the site.

In [None]:
# `world_url` : handles the URL of the Philippines News tab of the Maharlika News
# `data_maharlika_4` : handles the web scraped Philippine news articles data

world_url = 'https://www.maharlikanews.com/category/in-the-news/the-world-news/'

data_maharlika_4 = web_scrape_maharlika_news(world_url)

# Combining Fake News Data 

Now that we have collected the fake news data from **Ako'y Pilipino** and **Maharlika News**, we will now proceed to <u>combining the data into one dataframe</u>. 

To combine the fake news data from both sites, we  will be utilizing the panda' [`concat()`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) and make use of the pandas' [`reset_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) to fix the indexing of the combined fake news data.

In [None]:
fakenews_df = pd.concat([data_ap, data_maharlika,data_ap2,data_maharlika3,data_maharlika4, data_maharlika5]).reset_index(drop=True)
fakenews_df

# Saving the Combined Fake News Article Data to CSV
With the help of pandas' [`to_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html), we will be exporting the web scraped fake news articles stored in `fakenews_df` to CSV with the filename, **`fakenews_dataframe.csv`**. 

In [None]:
fakenews_df.to_csv("fakenews_dataframe.csv", index=False)