# Retrieving headlines

### Set up

You may need to install the libraries `beautifulsoup4` and `newspaper3k`.

The `GNews` library needs to be installed  from the Github source. Here is a [StackOverflow forum] I referenced, in case it is helpful.

In [1]:
import sys
!{sys.executable} -m pip install beautifulsoup4
!{sys.executable} -m pip install newspaper3k
!{sys.executable} -m pip install git+https://github.com/ranahaani/GNews.git

Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl (211 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.1/211.1 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting feedparser>=5.2.1 (from newspaper3k)
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting feedfinder2>=0.0.4 (from newspaper3k)
  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting jieba3k>=0.35.1 (from newspaper3k)
  Downloading jieba3k-0.35.1.zip (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tinysegmenter==0.3 (from newspaper3k)
  Downloading tinysegmenter-0.3.tar.gz (16 kB)

In [2]:
from gnews import GNews
import datetime as dt
import pandas as pd
import numpy as np

07/10/2023 09:31:15 PM - Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
07/10/2023 09:31:15 PM - NumExpr defaulting to 8 threads.


### Function

In [3]:
# set up the 2 week periods

first_day = np.ones(12, dtype = int)
middle_day = np.repeat(15, 12)
middle_day[1] = 14 # feb
last_day = np.tile([31, 30], 6)
last_day[7:12] = last_day[0:5]
last_day[1] = 28 # feb

start_days = []
end_days = []

for i in range(12):
    
    start_days.append(first_day[i])
    end_days.append(middle_day[i])
    
    start_days.append(middle_day[i])
    end_days.append(last_day[i])

months = np.repeat(range(12), 2) + 1

# print(start_days)
# print(end_days)
# print(months)

In [4]:
def get_headlines(year, keyword):
    """
    year: int 
    keyword: str, the company name
    """
    
    headlines_df = pd.DataFrame(columns = ["date", "title", "publisher"])
    
    for two_week_period in range(24):
    
        month = months[two_week_period]
        start_day = start_days[two_week_period]
        end_day = end_days[two_week_period]

        start = dt.datetime(year, month, start_day)
        end = dt.datetime(year, month, end_day)

        gnews = GNews(language = "en",
                      start_date = start, 
                      end_date = end)

        news_df = pd.DataFrame(gnews.get_news(keyword))

        if news_df.shape == (0, 0):
            print(f"No news between {start} and {end} for {keyword}.\n")
            continue

        news_df['date'] = pd.to_datetime(news_df['published date'])

        headlines_df = pd.concat([headlines_df, news_df[['date', 'title', 'publisher']].copy()],
                                 ignore_index = True)
    
    return headlines_df

### Retrieve data

Only run one cell at a time!

When running these cells, you will get errors about having no news for certain time periods. That's fine, don't re-run the cell. Keep it the way it is so we have records about when the headlines were missing. Just commit and push what you have from that one run.

One day later, it can be helpful to duplicate the cell, change the `range(2018, 2023+1)` to start from whichever year there is missing headlines, and run the code again.

In [None]:
# wafer: apple, amazon

In [None]:
# apple

company = "Apple"

for year in range(2018, 2023+1):

    headlines_df = get_headlines(year, company)

    file = "headlines/" + str(year) + "_" + company + "_headlines.csv"
    headlines_df.to_csv(file, index = False)

In [None]:
# amazon

company = "Amazon"

for year in range(2018, 2023+1):

    headlines_df = get_headlines(year, company)

    file = "headlines/" + str(year) + "_" + company + "_headlines.csv"
    headlines_df.to_csv(file, index = False)

In [None]:
# cindy: nvidia, microsoft

In [5]:
# nvidia

company = "Nvidia"

for year in range(2018, 2023+1):

    headlines_df = get_headlines(year, company)

    file = "headlines/" + str(year) + "_" + company + "_headlines.csv"
    headlines_df.to_csv(file, index = False)

No news between 2023-07-15 00:00:00 and 2023-07-31 00:00:00 for Nvidia.

No news between 2023-08-01 00:00:00 and 2023-08-15 00:00:00 for Nvidia.

No news between 2023-08-15 00:00:00 and 2023-08-31 00:00:00 for Nvidia.

No news between 2023-09-01 00:00:00 and 2023-09-15 00:00:00 for Nvidia.

No news between 2023-09-15 00:00:00 and 2023-09-30 00:00:00 for Nvidia.

No news between 2023-10-01 00:00:00 and 2023-10-15 00:00:00 for Nvidia.

No news between 2023-10-15 00:00:00 and 2023-10-31 00:00:00 for Nvidia.

No news between 2023-11-01 00:00:00 and 2023-11-15 00:00:00 for Nvidia.

No news between 2023-11-15 00:00:00 and 2023-11-30 00:00:00 for Nvidia.

No news between 2023-12-01 00:00:00 and 2023-12-15 00:00:00 for Nvidia.

No news between 2023-12-15 00:00:00 and 2023-12-31 00:00:00 for Nvidia.



In [6]:
# microsoft

company = "Microsoft"

for year in range(2018, 2023+1):

    headlines_df = get_headlines(year, company)

    file = "headlines/" + str(year) + "_" + company + "_headlines.csv"
    headlines_df.to_csv(file, index = False)

No news between 2023-07-15 00:00:00 and 2023-07-31 00:00:00 for Microsoft.

No news between 2023-08-01 00:00:00 and 2023-08-15 00:00:00 for Microsoft.

No news between 2023-08-15 00:00:00 and 2023-08-31 00:00:00 for Microsoft.

No news between 2023-09-01 00:00:00 and 2023-09-15 00:00:00 for Microsoft.

No news between 2023-09-15 00:00:00 and 2023-09-30 00:00:00 for Microsoft.

No news between 2023-10-01 00:00:00 and 2023-10-15 00:00:00 for Microsoft.

No news between 2023-10-15 00:00:00 and 2023-10-31 00:00:00 for Microsoft.

No news between 2023-11-01 00:00:00 and 2023-11-15 00:00:00 for Microsoft.

No news between 2023-11-15 00:00:00 and 2023-11-30 00:00:00 for Microsoft.

No news between 2023-12-01 00:00:00 and 2023-12-15 00:00:00 for Microsoft.

No news between 2023-12-15 00:00:00 and 2023-12-31 00:00:00 for Microsoft.

