# **Data Collection Part 2**: Web Scraping <u>Fake News</u>

This notebook demonstrates the first part of this project's data collection which is web scraping fake news from news sites in the Philippines. 

The **news sites involved** in collecting real news are the following: 
1. [`Ako'y Pilipino`](https://www.gmanetwork.com/news/topstories/)
2. [`Maharlika News`](https://www.rappler.com/)

# Importing Libraries
To start, we will be importing libraries that would help us perform web scraping and data processing properly.

## Basic Libraries
*  [`requests`](https://pypi.org/project/requests/): has functions for HTTP requests 
> allows us to make requests to a web page and access its contents

*   [`pandas`](https://pandas.pydata.org/) : has functions for data analysis and manipulation
> this allows us to obtain and organize all scraped data in a format that can be used for Exploratory Data Analysis (EDA) and for creating a model for fake news detection

In [2]:
import requests
import pandas as pd

## Web Scraping Library: [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

> The [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library will be used for web scraping fake news from Ako'y Filipino and Maharlika News since we will only be scraping 1 page at a time on both sites.

In [3]:
from bs4 import BeautifulSoup

# Web Scraping Fake News Sites

## Web Scraping Ako'y Pilipino Articles

In [71]:
# This Web Scraper makes use of BeautifulSoup to display various information about fake news articles from "akoy-pilipino.blogspot.com".

def web_scrape_akoy_pilipino(url):
    # Send a GET request to the website
    response = requests.get(url)

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Find the HTML elements that contain the fake news articles
    articles = soup.find_all("article")

    # Create empty lists to store the data
    titles = []
    times = []
    authors = []
    contents = []

    # Extract information from each article
    for article in articles:
        # Get the title of the article
        title = article.find('h2', class_='post-title entry-title')
        title = title.find('a', class_='').text.strip()
        titles.append(title)

        # Get the publication date of the article
        time = article.find('div', id='meta-post').text.strip()
        times.append(time)

        # Get the preview description of the article
        content = article.find('div', class_='entry').text.strip()
        contents.append(content)


    # Create a DataFrame to store the data
    data = pd.DataFrame({
        'Title': titles,
        'Date Posted': times,
        'Content': contents
    })

    # Print the data
    print(data)
    
    return data

In [None]:
# URL of the website you want to scrape
url = 'http://akoy-pilipino.blogspot.com/search/label/Local%20News'

data_ap = web_scrape_akoy_pilipino(url)

In [None]:
# URL of the website you want to scrape
url = 'http://akoy-pilipino.blogspot.com/search/label/International%20News'

data_ap_2 = web_scrape_akoy_pilipino(url)

## Web Scraping Maharlika News Articles

In [66]:
# This Web Scraper makes use of BeautifulSoup to display various information about fake news articles from "https://www.maharlikanews.com/".
def web_scrape_maharlika_news(url):
    # Send a GET request to the website
    response = requests.get(url)

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Find the HTML elements that contain the fake news articles
    articles = soup.find_all("article")
    
    # Create empty lists to store the data
    titles = []
    times = []
    authors = []
    contents = []
    
    # Extract information from each article
    for article in articles:
        # Get the title of the article
        title = article.find('h2', class_='post-box-title').text.strip()
        titles.append(title)

        # Get the publication date of the article
        time = article.find('p', class_='post-meta').text.strip()
        times.append(time)

        # Get the preview description of the article
        content = article.find('div', class_='entry').text.strip()
        contents.append(content)

    # Create a DataFrame to store the data
    data = pd.DataFrame({
        'Title': titles,
        'Date Posted': times,
        'Content': contents
    })

    # Print the data
    print(data)
    
    return data

In [67]:
# This Web Scraper makes use of BeautifulSoup to display various information about fake news articles from "https://www.maharlikanews.com/".
# URL of the website you want to scrape
url = 'https://www.maharlikanews.com/category/world-of-business/'

data_maharlika = web_scrape_maharlika_news(url)

                                               Title  \
0  Striking Metro grocery store worker says she ‘...   
1  Unemployment inches up to 5.5% in July with sp...   
2  Telus slashing 6,000 jobs amid drop in 2nd qua...   
3  Your latest questions about Bill C-18 and the ...   
4  Stock markets slump as rating agency Fitch dow...   
5  National Bank buys Silicon Valley Bank’s Canad...   
6  Tupperware warned it might go bust — but its s...   
7  Meta permanently ending news availability on i...   
8  ‘Barbenheimer’ made this July the best one eve...   
9  These Canadian companies switched to a 4-day w...   

                  Date Posted  \
0  2 days ago\nBusiness World   
1  2 days ago\nBusiness World   
2  2 days ago\nBusiness World   
3  3 days ago\nBusiness World   
4  4 days ago\nBusiness World   
5  4 days ago\nBusiness World   
6  5 days ago\nBusiness World   
7  5 days ago\nBusiness World   
8  5 days ago\nBusiness World   
9  7 days ago\nBusiness World   

                   

In [68]:
# URL of the website you want to scrape
url = 'https://www.maharlikanews.com/category/technology/'

data_maharlika_2 = web_scrape_maharlika_news(url)

                                               Title  \
0  I Looked Into Sam Altman’s Orb and All I Got W...   
1   Meta Just Proved People Hate Chronological Feeds   
2  Meta’s Election Research Opens More Questions ...   
3  There’s no flashy Oppenheimer cameo, but this ...   
4  Century-old turtle missing for 1 year found de...   
5  More Battlefield AI Will Make the Fog of War M...   
6  Big AI Won’t Stop Election Deepfakes With Wate...   
7  Wildfire smoke in your eyes? Doctors say we ne...   
8  What you won’t learn about in Oppenheimer: the...   
9  Meta’s Open Source Llama Upsets the AI Horse Race   

                Date Posted                                            Content  
0  14 hours ago\nTechnology  Joel Khalili Business Jul 28, 2023 6:00 AM I L...  
1  14 hours ago\nTechnology  Paresh Dave Business Jul 27, 2023 2:00 PM Meta...  
2  14 hours ago\nTechnology  Vittoria ElliottParesh Dave Business Jul 27, 2...  
3    2 days ago\nTechnology  Of all the scientists involved

In [69]:
# URL of the website you want to scrape
url = 'https://www.maharlikanews.com/category/in-the-news/philippine-news/'

data_maharlika_3 = web_scrape_maharlika_news(url)

                                               Title  \
0  US denounces Chinese Coast Guard’s ‘excessive’...   
1  PH health expert among candidates for WHO West...   
2  Romualdez marks ‘impressive’ first year as Spe...   
3  1 dead, 95 saved from boat mishap off Romblon ...   
4  Teodoro: Natl security concern in EDCA sites, ...   
5                    Inflation slows further to 4.7%   
6   Manila probes incident of toppled electric posts   
7              President taps Corpus, Hosaka for GCG   
8  Marcos asks actors’ group to lift TV, film sta...   
9                    ‘Look kindly’ on PH, China told   

                              Date Posted  \
0  8 hours ago\nHeadline, Philippine News   
1    1 day ago\nHeadline, Philippine News   
2             2 days ago\nPhilippine News   
3             2 days ago\nPhilippine News   
4   2 days ago\nHeadline, Philippine News   
5             3 days ago\nPhilippine News   
6   3 days ago\nHeadline, Philippine News   
7             3 days ag

In [70]:
# URL of the website you want to scrape
url = 'https://www.maharlikanews.com/category/in-the-news/the-world-news/'

data_maharlika_4 = web_scrape_maharlika_news(url)

                                               Title            Date Posted  \
0  Rescue, evacuation efforts intensify as deadly...   1 day ago\nWorldNews   
1  This baby was found under the rubble of a dead...   1 day ago\nWorldNews   
2  Pakistan ex-PM Imran Khan arrested after court...  2 days ago\nWorldNews   
3  The scandals swirling around Hunter Biden — wh...  2 days ago\nWorldNews   
4  NYC police to charge Twitch streamer after fan...  2 days ago\nWorldNews   
5  Canada says Armenians face ‘deteriorating huma...  2 days ago\nWorldNews   
6  Russian warship seriously damaged in Ukrainian...  2 days ago\nWorldNews   
7  Greenpeace protesters arrested after stunt at ...  4 days ago\nWorldNews   
8  Jury recommends death penalty for shooter who ...  4 days ago\nWorldNews   
9  Rudy Giuliani defiant as Trump co-conspirators...  4 days ago\nWorldNews   

                                             Content  
0  Thousands of people threatened by storm-swolle...  
1  Baby Afraa surviv

# Combining Data From Both Sites

In [None]:
fakenews_df = pd.concat([data_ap, data_maharlika,data_ap2,data_maharlika3,data_maharlika4, data_maharlika5]).reset_index(drop=True)
fakenews_df

# Saving the Combined Fake News Article Data to CSV
With the help of pandas' [`to_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html), we will be exporting the web scraped articles stored in `fakenews_df` to CSV with the filename, **`fakenews_dataframe.csv`**. 

In [None]:
fakenews_df.to_csv("fakenews_dataframe.csv", index=False)

NameError: name 'fakenews_df' is not defined