# Scraping

We start with scraping texts on one company. I chose Google (`GOOG`) arbitrarily.

## Google News 

Google news can be modified to look at older articles from [here](https://stackoverflow.com/questions/73072802/web-scraping-articles-from-google-news).


In [12]:
import sys
# !{sys.executable} -m pip install beautifulsoup4
# !{sys.executable} -m pip install newspaper3k
!{sys.executable} -m pip install git+https://github.com/ranahaani/GNews.git

Collecting git+https://github.com/ranahaani/GNews.git
  Cloning https://github.com/ranahaani/GNews.git to /private/var/folders/36/0q68lh2x6rs8rcn9slq7pd9m0000gn/T/pip-req-build-y0jpgt4f
  Running command git clone --filter=blob:none --quiet https://github.com/ranahaani/GNews.git /private/var/folders/36/0q68lh2x6rs8rcn9slq7pd9m0000gn/T/pip-req-build-y0jpgt4f
  Resolved https://github.com/ranahaani/GNews.git to commit 8591313e3fdaaf44e2e09f2265254fc3aaea8b56
  Preparing metadata (setup.py) ... [?25ldone


In [13]:
from gnews import GNews
import datetime as dt
import pandas as pd
import numpy as np

In [14]:
google_news = GNews(language = 'en')
GOOG = google_news.get_news('Google')

In [None]:
GOOG = pd.DataFrame(GOOG)
GOOG['date'] = pd.to_datetime(GOOG['published date'])

In [None]:
min(GOOG['date'])

In [None]:
max(GOOG['date'])

In [16]:
# example 
# need a start and end date!

start_date = dt.date(2022, 5, 1)
end_date = dt.date(2022, 6, 1)
tmp = GNews(language = 'en')
tmp.start_date = start_date
tmp.end_date = end_date

random = tmp.get_news('reddit')

In [17]:
random

[]

In [8]:
min(pd.DataFrame(random)['published date'])

KeyError: 'published date'

In [None]:
max(pd.DataFrame(random)['published date'])

## Goal: retrieve all news about "Google" from 01/01/2018 - 12/31/2018

In [2]:
first_day = np.ones(12, dtype = int)
middle_day = np.repeat(15, 12)
middle_day[1] = 14 # feb
last_day = np.tile([31, 30], 6)
last_day[7:12] = last_day[0:5]
last_day[1] = 28 # feb

start_days = []
end_days = []

for i in range(12):
    
    start_days.append(first_day[i])
    end_days.append(middle_day[i])
    
    start_days.append(middle_day[i])
    end_days.append(last_day[i])


In [3]:
print(start_days)
print(end_days)

[1, 15, 1, 14, 1, 15, 1, 15, 1, 15, 1, 15, 1, 15, 1, 15, 1, 15, 1, 15, 1, 15, 1, 15]
[15, 31, 14, 28, 15, 31, 15, 30, 15, 31, 15, 30, 15, 31, 15, 31, 15, 30, 15, 31, 15, 30, 15, 31]


In [4]:
months = np.repeat(range(12), 2) + 1
months

array([ 1,  1,  2,  2,  3,  3,  4,  4,  5,  5,  6,  6,  7,  7,  8,  8,  9,
        9, 10, 10, 11, 11, 12, 12])

In [None]:
# scrape top 100 search results for keyword "Google" 
# for all 2-week periods in 2018

In [5]:
headlines_df = pd.DataFrame(columns = ["date", "title"])

In [6]:
for two_week_period in range(24):
    
    month = months[two_week_period]
    start_day = start_days[two_week_period]
    end_day = end_days[two_week_period]
    
    start = dt.datetime(2018, month, start_day)
    end = dt.datetime(2018, month, end_day)
    
    gnews = GNews(language = "en",
                  start_date = start, 
                  end_date = end)
    
    news_df = pd.DataFrame(gnews.get_news('Google'))
    
    if news_df.shape == (0, 0):
        print(f"No news between {start} and {end}.\n")
        continue
        
    news_df['date'] = pd.to_datetime(news_df['published date'])
    
    headlines_df = pd.concat([headlines_df, news_df[['date', 'title']].copy()],
                             ignore_index = True)

No news between 2018-01-01 00:00:00 and 2018-01-15 00:00:00.



KeyboardInterrupt: 

In [None]:
min(headlines_df['date'])

In [None]:
max(headlines_df['date'])

In [None]:
# check how many articles are outside of date range

sum(headlines_df['date']> pd.to_datetime('2018-02-14').tz_localize('utc'))

In [None]:
# number of unique headlines

len(headlines_df['title'].unique())

In [None]:
# number of headlines not including google 

2244 - sum(pd.Series(headlines_df['title'].unique()).str.contains('google', case = False))

In [None]:
headlines_df.loc[~ headlines_df['title'].str.contains('google', case = False)]


## Things that didn't work

#### Twitter API ([tutorial](https://towardsdatascience.com/an-extensive-guide-to-collecting-tweets-from-twitter-api-v2-for-academic-research-using-python-3-518fcb71df2a))

Actually, Twitter's API has a cap of 1,500 tweets per month. So that's a no go.

#### Reddit API

PRAW API seems to have higher limits. 
