# GNews Scraper
This notebook scrapes news articles from Google News for specific keywords and extracts relevant information. The extracted data is saved into CSV files for further analysis.

## Introduction
In this notebook, we will create a scraper to extract news articles from Google News. The process involves:
1. Retrieving news articles for specific keywords.
2. Parsing the news articles to extract relevant information.
3. Saving the extracted data into CSV files.

## Import Libraries
We start by importing necessary libraries. We will use:
- `requests` for making HTTP requests.
- `BeautifulSoup` from the `bs4` package for parsing HTML content.
- `pandas` for handling data in DataFrame format.
- `datetime` for date manipulation.
- `time` for adding delays between requests to avoid server overload.

In [None]:
# Import necessary libraries
import feedparser
import re
import pandas as pd
import os
from openai import OpenAI
import random

## Define Scraping Functions
Next, we define the functions necessary for scraping Google News articles. These functions include:
- `get_news`: Retrieves news articles for a given keyword.
- `parse_article`: Parses the news articles and extracts relevant information.
- `save_to_csv`: Saves the extracted information into a CSV file.

In [None]:
class googleNewsFeedScraper:
    def __init__(self, query, value: int, duration):
        self.query = query
        self.value = value
        if duration == "hours" or duration == "hour":
            self.duration = 'h'
        elif duration == "days" or duration == "days":
            self.duration = 'd'
        elif duration == "months" or duration == "month":
            self.duration = 'm'
        elif duration == "years" or duration == "year":
            self.duration = 'y'
        else:
            self.duration = None
        

    def scrape_google_news_feed(self):
        if self.value is None or self.duration is None:
            rss_url = f'https://news.google.com/rss/search?q={self.query}%20when%3A&hl=en-US&gl=US&ceid=US:en'
        else:
            rss_url = f'https://news.google.com/rss/search?q={self.query}%20when%3A{self.value}{self.duration}&hl=en-US&gl=US&ceid=US:en'
        feed = feedparser.parse(rss_url)

        if feed.entries:
            for entry in feed.entries:
                title = entry.title
                link = entry.link
                description = entry.description
                pubdate = entry.published
                source = entry.source
                pattern = re.compile(r'<a[^>]*>(.*?)</a>', re.IGNORECASE)
                filtered_texts = pattern.findall(description)
                print(f"Title: {title}\nLink: {link}\nDescription: {filtered_texts}\nPublished: {pubdate}\nSource: {source}")
                print("-+-")
        else:
            print("Nothing Found!")

## Example Usage
Finally, we demonstrate the usage of these functions by scraping news articles for specific keywords and saving the data into CSV files.


In [None]:
main_df = pd.DataFrame()
folder_path = "sec"
csv_files = [file for file in os.listdir(folder_path) if file.endswith('.csv')]

for file in csv_files:
    df = pd.read_csv(os.path.join(folder_path, file))
    if not df.empty:
        main_df = pd.concat([main_df, df], ignore_index=True)

### NOTE: You must use your own secret here for the OpenAI API

In [None]:
model = "gpt-3.5-turbo-0125"
# key = {YOUR SECRET KEY}

In [None]:
messages = []
for index, row in main_df.iterrows():
    text = row['Risk Factors Text']
    date = row['Fill Date']
    company_name = row['Company Name']
    prefix = f"Here is an SEC Filing Risk Section for {company_name} filled on date {date} : "
    suffix = "Can you please tell if there are any specific news and real-world events mentioned in this? They need to be real world events mentioned."
    temp = prefix + "\n\n\n" + str(text) + "\n\n\n" + suffix
    messages.append([date, company_name, temp])

In [None]:
def filter_year(messages, year):
    ret = [msg for msg in messages if msg[0][:4] == str(year)]
    return ret

In [None]:
client = OpenAI(api_key=key)

In [None]:
objects = []
years = [2010, 2013, 2015, 2019, 2021]
send_messages = [messages, messages, messages, messages, messages]
filtered_msgs = map(filter_year, messages, years)

for sent_message in send_messages:
    elmnt_list = random.sample(sent_message, 10)
    for msg in elmnt_list:
        size = len(msg[2].split()) 
        if size > 11000:
            objects.append(None)
            continue
        response = client.chat.completions.create(model=model,messages=[{"role": "user", "content": msg[2]}],temperature=0)
        objects.append(response)

In [None]:
for response in objects:
    print(response.choices[0].message.content)
    print()

Yes, there are specific real-world events mentioned in this SEC filing risk section for DEERE & CO. Some of these events include:

1. The outcome of global negotiations under the World Trade Organization that could affect the international flow of agricultural commodities.
2. The potential impact of the 2007 Farm Bill in the United States on prices for farm commodities, particularly corn, cotton, and rice.
3. The policies of the Brazilian government, including those related to exchange rates and commodity prices, that could significantly change the dynamics of the agricultural economy in Brazil.
4. Changing worldwide demand for food and bio-energy that could affect prices for farm commodities and demand for agricultural equipment.
5. The continuing globalization of agricultural businesses that may change the dynamics of competition, customer base, and product offerings for DEERE & CO.
6. General economic conditions that could affect demand for the company's equipment, including negativ

AttributeError: 'NoneType' object has no attribute 'choices'

In [None]:
for i in range(len(elmnt_list)):
    print(elmnt_list[i][0])
    print(elmnt_list[i][1])
elmnt_list[0][0]

2017-02-22
EQUIFAX INC
2021-04-22
APOGEE ENTERPRISES INC
2016-03-29
CALERES INC
2008-02-21
DANAHER CORP
2017-02-24
AFLAC INC
2016-02-12
ILLINOIS TOOL WORKS
2020-06-29
FRIEDMAN INDUSTRIES INC
2014-03-17
DAWSON GEOPHYSICAL CO
2022-02-24
ANDERSONS INC
2009-09-29
CRACKER BARREL OLD CTRY STOR


'2017-02-22'

In [None]:
if __name__ == "__main__":
    query = 'tornado'
    value = "3"
    duration = "years"
    scraper = googleNewsFeedScraper(query, value, duration)
    scraper.scrape_google_news_feed()

Title: PurdueALERT test, campuswide tornado drill scheduled for March 12 - Purdue University
Link: https://news.google.com/rss/articles/CBMidmh0dHBzOi8vd3d3LnB1cmR1ZS5lZHUvbmV3c3Jvb20vcmVsZWFzZXMvMjAyNC9RMS9wdXJkdWVhbGVydC10ZXN0LWNhbXB1c3dpZGUtdG9ybmFkby1kcmlsbC1zY2hlZHVsZWQtZm9yLW1hcmNoLTEyLmh0bWzSAQA?oc=5
Description: ['PurdueALERT test, campuswide tornado drill scheduled for March 12']
Published: Thu, 07 Mar 2024 08:00:00 GMT
Source: {'href': 'https://www.purdue.edu', 'title': 'Purdue University'}
-+-
Title: April 4-5, 2023: Destructive Winds, Very Large Hail, & Tornadoes - National Weather Service
Link: https://news.google.com/rss/articles/CBMiLGh0dHBzOi8vd3d3LndlYXRoZXIuZ292L2R2bi9zdW1tYXJ5XzA0MDQyMDIz0gEA?oc=5
Description: ['April 4-5, 2023: Destructive Winds, Very Large Hail, & Tornadoes']
Published: Tue, 04 Apr 2023 07:00:00 GMT
Source: {'href': 'https://www.weather.gov', 'title': 'National Weather Service'}
-+-
Title: The December 2021 tornado outbreak, explained - National Oc