<h1>Extracting Real-Time Headlines from News Channels</h1>

<p>Extracting headlines from the top 100 news websites, this code aims to provide an accurate and real-time list of relevant headlines pertaining to coronavirus outbreaks. A detailed step-by-step explanation of the code is shown below:</p>

<h2>Step 1: Install and Import Libraries</h2>

Each of the libraries imported provides a sizable contribution to the code.

In [None]:
!pip install bs4
!pip install django

from bs4 import BeautifulSoup as bs
import urllib.request
from django.core.validators import URLValidator
from django.core.exceptions import ValidationError
import time
from datetime import datetime
import boto3

<h2>Step 2: Initialize and Empty the S3 Bucket</h2>

Using the boto3 library and the AWS account configuration credentials, an object is cleared and put into an S3 bucket.

In [None]:
s3 = boto3.resource(
    's3',
    region_name='us-east-1',
    aws_access_key_id=*HIDDEN*,
    aws_secret_access_key=*HIDDEN*
)

content = ""
s3.Object('headlines', 'headline.txt').put(Body=content)

<h2>Step 3: Create a Function that Extracts Websites</h2>

<p> First, using the urllib library, the code opens the URL and converts it into HTML. Next, utilizing the Beautiful Soup library, the HTML is parsed, and hyperlinks are extracted from the HTML code. Implementing the Django library, the hyperlink URLs are validated, filtered based on the exception URL, and added to an array. This array of URLs is then returned.

In [4]:
def extractWebsites(url, exceptionURL):
    try:
        webUrl = urllib.request.urlopen(url)
        data = webUrl.read()
        soup = bs(data)
        arr = []
        for link in soup.find_all('a'):
            validate = URLValidator()
            href = link.get('href')
            if (href != None):
                try:
                    validate(href)
                    if (href.find(exceptionURL) == -1):
                        arr.append(href)
                except ValidationError as exception:
                    continue
        return arr
    except:
        return []

<h2>Step 4: Create a Function that Searches a URL for Specific Key Words</h2>

The function iterates through the array of key words, and returns true if any of the key words are within the URL.

In [5]:
def search_array(url, arrOfPoss):
    for index in arrOfPoss:
        if (url.find(index.capitalize())!=-1 or url.find(index.lower())!=-1):
            return True
    return False

<h2>Step 5: Extract the Headlines</h2>

First, an array of news channels are selected from Feedspot's top 100 news websites. Then, that array is iterated through using the extractWebsites function to find pertinent websites that pertain to COVID-19. "Interactive" is chosen as the exceptionURL, as these do not provide relevant headlines. Other headlines are filtered using the search_array function with a set of COVID-19 key words. These filtered headlines are stored in a new array called covid_urls.

Using the extractWebsites function again, covid_urls are iterated through to find more COVID-related websites. To reduce runtime, a counter is implemented that stops if there are an excessive amount of websites to run through. Additionally, if this procedure takes longer than ten minutes, the inner for loop will terminate. Once again, "interactive" is chosen as the execptionURL, and the headlines are filtered using the search_array function with the same set of COVID-19 key words. These filtered headlines are stored in a new array called new_covid_urls.

This process is repeated one more time to optimize the amount of relevant headlines extracted. The filtered headlines are stored in another array called newest_covid_urls. To store the relevant headlines in a text file called "headline.txt", the URLs inside new_covid_urls and newest_covid_urls were iterated through. For each website in these arrays, the URL was converted to HTML to determine the title of the URL. Once checked for repetition in the headlines array, the title was added inside the S3 object and the headlines array. The final headlines array was put inside the S3 object.

In [None]:
for index in data:
    if (index.find('npr')!=-1 or index.find('NPR')!=-1):
        continue
    new_data = extractWebsites(index, 'interactive')
    covid_urls = []
    arr_keywords = ['coronavirus', 'COVID', 'covid', 'pandemic', 'epidemic', 'disease', 'SARS', 'sars', 'virus']
    for index1 in new_data:
        if (search_array(index1, arr_keywords)):
            covid_urls.append(index1)
    count = 0
    for val in covid_urls:
        if count > 13:
            break
        future = time.time() + 600
        newer_data = extractWebsites(val, "interactive")
        new_covid_urls = []
        for index1 in newer_data:
            if (search_array(index1, arr_keywords)):
                new_covid_urls.append(index1)
        for value in new_covid_urls:
            if (time.time()) > future:
                break
            newest_data = extractWebsites(value, "interactive")
            newest_covid_urls = []
            for index2 in newest_data:
                if (time.time()) > future:
                    break
                if (search_array(index2, arr_keywords)):
                    newest_covid_urls.append(index2)
            for urls1 in newest_covid_urls:
                if ((time.time()) > future):
                    break
                try:
                    webUrl = urllib.request.urlopen(urls1)
                    data = webUrl.read()
                    soup = bs(data)
                    try:
                        title = soup.find('title').string
                        if (title not in headlines) and (search_array(title, arr_keywords)):
                            headlines.append(title)
                            content = "\n".join(headlines)
                            s3.Object('headlines', 'headline.txt').put(Body=content)
                    except TypeError as exception:
                        print("Exception occured")
                        continue
                except:
                    continue
        for urls in new_covid_urls:
            if (time.time()) > future:
                break
            try:
                webUrl = urllib.request.urlopen(urls)
                data = webUrl.read()
                soup = bs(data)
                try:
                    title = soup.find('title').string
                    if (title not in headlines) and (search_array(title, arr_keywords)):
                        headlines.append(title)
                        content = "\n".join(headlines)
                        s3.Object('headlines', 'headline.txt').put(Body=content)
                        
                except:
                    continue
            except:
                continue
                
content = "\n".join(headlines)
s3.Object('headlines', 'headline.txt').put(Body=content)