**Gabriela Tanumihardja**</br>
**Capstone Project - Part I** </br>
**Data acquisition - webscraping**

## Table of contents
1. [Introduction](#intro)
2. [Beaverton](#bvt)
3. [The Globe and Mail](#gm)
4. [The Onion](#onion)
5. [The New York Times](#nyt)
6. [The New York Time - rescrape](#nyt2)

***

In [6]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import time
import requests
import shutil
import urllib
import re
import json

***

### Introduction
<a id='intro'></a> 

Satire is a literary device which draws the reader's attention to shortcomings and vices of people, organizations, or society. Though often humourus, the purpose of satire is to make the readers think and maybe induce a change in the system itself. There has been a lot of satirical novels that have been published throughout history; for example 1984 by George Orwell and A Clockwork Orange by Anthony Burgess. Satire is also commonly used in news production, especially on the web. Though satirical news are never meant to delude their readers, often times it is very difficult to distinguish them from legitimate news. Website like the Onion and the Beaverton are very similar in appearance to legitimate news sources. I found that if I didn't know that the Onion exclusively publish satirical news pieces, I would have a hard time distinguishing articles just based on their headlines. Based on this, I would like to build a model that could potentially distinguish whether or not a news headline is satirical or legitimate. It also would be very interesting to see if a model could predict the source of the news headline. </br>

In starting this project, I would need to obtain some news headlines, satirical and legitimate, from news websites. I have chosen the Beaverton, the Onion, the Globe and Mail, and the New York Times. I opted for these sources as they are well established and their articles are of high quality. I chose to limit my news sources to be Canadian and American to eliminate any biases related to customs, tradition, or local events. I would like to make sure that the data I use is relatively even in terms to sources, especially in satire/legitimate class. I will now move on to scrape these sites, starting with **the Beaverton**. I will be using BeautifulSoup package to scrape website information and Selenium to automate my browser. 

***

Setting up browser in Chrome for Selenium:

In [2]:
# Save path to chromedriver executable file to variable

chrome_path = '/Users/gabrielatanumihardja/opt/chromedriver'

In [17]:
# Save the chrome webdriver to variable
# Open browser

browser = webdriver.Chrome(chrome_path)

***

## Beaverton 🐿️
<a id='bvt'></a>

Looking at the website, I will scrape the articles from `National`, `World`, `Sports`, `Business`, and `Culture` tabs. Each page has a next button that has a class name of `next`. I will use this to specify Selenium's click task. Each tab has different number of pages, so I will insert a `try` to click on the next button unless there is a `NoSuchELementException` thrown, in which case I will break out of the current loop and go to the next tab.

#### Loops

In [49]:
# Specify the tabs

topic_list_bvt = ['national', 'world', 'sports', 'business', 'culture']

In [50]:
# Create empty lists for title, topics, date, and source (will be beaverton)

titles_list_bvt = []
topics_bvt = []
date_bvt = []
source_bvt = []

# Function that will go to the first website, scrape all the data, append all the articles, date, 'beaverton' as source, and topics to empty lists.

def scrape_bvt(topic):
    websource_bvt = browser.page_source
    
    # get websource
    
    soup = BeautifulSoup(websource_bvt)

    articles = soup.find_all('h3', itemprop = 'headline')
        
    dates = soup.find_all('time', datetime = True)
        
    for article in articles:
        titles_list_bvt.append(article.text)
        topics_bvt.append(topic)
        source_bvt.append('beaverton')
            
    for date in dates:
        date_bvt.append(date['content'])

# Loop through pages and tabs with try and except built in

for topic in topic_list_bvt:
    browser.get(f'https://www.thebeaverton.com/news/{topic}/')
    
    while True:
        try:
            scrape_bvt(topic)
            browser.find_element_by_class_name('next').click()
            time.sleep(5)
        except NoSuchElementException:
            break

Everything seems to work! I will put the scraped data into a df which I could export later as csv's.

In [51]:
# Make df

articles_df = pd.DataFrame(
    {'title': titles_list_bvt,
     'topic': topics_bvt,
     'date_published': date_bvt,
     'source': source_bvt
    })

In [52]:
# Check!

articles_df

Unnamed: 0,title,topic,date_published,source
0,Guy who has definitely gotten into a fight at ...,national,2020-08-24T12:20:03-04:00,beaverton
1,Party that wants to manage government can’t ma...,national,2020-08-24T09:49:25-04:00,beaverton
2,Canada searches for new country to compare our...,national,2020-08-21T15:14:56-04:00,beaverton
3,Trudeau hopes giving Parliament five week vaca...,national,2020-08-18T19:08:42-04:00,beaverton
4,Highlights of Andrew Scheer’s tenure as Conser...,national,2020-08-17T11:34:45-04:00,beaverton
...,...,...,...,...
2717,Beatles fan unimpressed by rest of humanity,culture,2011-10-02T19:03:09-04:00,beaverton
2718,CBC to change fall programming,culture,2011-07-08T20:22:07-04:00,beaverton
2719,Local Yeti captures rare photograph of graffit...,culture,2011-05-16T06:05:16-04:00,beaverton
2720,Eminem makes another mean face for photo shoot,culture,2011-01-21T09:20:49-05:00,beaverton


In [53]:
# Export df to a csv

articles_df.to_csv('data/beaverton.csv')

In [54]:
# Check!

bvt = pd.read_csv('data/beaverton.csv')

In [56]:
# Double check!

len(bvt)

2722

***

## Globe and Mail 🌎
<a id='gm'></a>

From the Globe and Mail I will scrape headlines from the `Canada`, `World`, `Sports`, `Arts`, and `Politics` sections of the websites.

#### Loops

In [18]:
# Specify topic list and empty lists

topic_list_gm = ['canada', 'world', 'sports', 'arts', 'politics']
titles_list_gm = []
topics_gm = []
date_gm = []
source_gm = []

The Globe and Mail website is constructed differently to the Beaverton's website. It seems that with each click of the `View More` button, the page lengthen and new stories are appended at the end of the list. For the sake of simplicity, I will scrape articles below the `Latest` tag. The dates are quite tricky to obtain as the tags vary between articles, so I will pass in both classes into my loop. Because the page lengthen for each click, I will click the button a number of times and scrape the whole page after. I will click the button 80 times, as each page generally puts up 20 new articles. With 80 clicks, 20 articles per click, and 5 topics, I will have more than 8000 articles. I think this will be sufficient for my project. 

In [19]:
# Starting loop

for topic in topic_list_gm:
    
    browser.get(f'https://www.theglobeandmail.com/{topic}/')
    
    button = browser.find_element_by_xpath(f'//button[normalize-space()= "View More {topic.title()}"]')
    
    # 80 clicks should be sufficient
    for i in range(80):
        browser.execute_script("arguments[0].click();", button)
        
        # Print to make sure that it's still clickin' away
        if i%5 == 0:
            print(f'Click {i}')
        
        time.sleep(10)

    websource_gm = browser.page_source

    soup = BeautifulSoup(websource_gm)
    
    # Scrape articles from **below** the latest tag
    latest = soup.find('div', class_ = 'u-wrapper pb-feature pb-layout-item pb-f-global-story-feed')

    articles_gm = latest.findChildren('div', class_ = 'c-card__hed-text')

    dates_gm = latest.findChildren('time', class_ = ['c-timestamp u-no-wrap js-story-moment',
                                                     'c-timestamp u-no-wrap'])
    
    # Append the articles, dates, source, and topics to empty lists
    for article in articles_gm:
            titles_list_gm.append(article.text)
            topics_gm.append(topic)
            source_gm.append('the globe and mail')
            
    for date in dates_gm:
                date_gm.append(date['datetime'])

Click 0
Click 5
Click 10
Click 15
Click 20
Click 25
Click 30
Click 35
Click 40
Click 45
Click 50
Click 55
Click 60
Click 65
Click 70
Click 75
Click 0
Click 5
Click 10
Click 15
Click 20
Click 25
Click 30
Click 35
Click 40
Click 45
Click 50
Click 55
Click 60
Click 65
Click 70
Click 75
Click 0
Click 5
Click 10
Click 15
Click 20
Click 25
Click 30
Click 35
Click 40
Click 45
Click 50
Click 55
Click 60
Click 65
Click 70
Click 75
Click 0
Click 5
Click 10
Click 15
Click 20
Click 25
Click 30
Click 35
Click 40
Click 45
Click 50
Click 55
Click 60
Click 65
Click 70
Click 75
Click 0
Click 5
Click 10
Click 15
Click 20
Click 25
Click 30
Click 35
Click 40
Click 45
Click 50
Click 55
Click 60
Click 65
Click 70
Click 75


All seem to works as expected, I will now convert the data into a df and save the csv.

In [22]:
# Make df

articles_df_gm = pd.DataFrame(
    {'title': titles_list_gm,
     'topic': topics_gm,
     'date_published': date_gm,
     'source': source_gm
    })

In [23]:
# Check!

articles_df_gm

Unnamed: 0,title,topic,date_published,source
0,Artificial intelligence needed to fight child...,canada,2020-08-24T22:28:28.391Z,the globe and mail
1,Evening Update: First documented coronavirus ...,canada,2020-08-24T20:54:37.240Z,the globe and mail
2,New study calls for fresh approach to tacklin...,canada,2020-08-24T20:47:44.753Z,the globe and mail
3,‘Nerve-racking’: Staff talk about stress of f...,canada,2020-08-24T20:41:33.986Z,the globe and mail
4,Alberta Health Minister still talking with do...,canada,2020-08-24T19:52:20.261Z,the globe and mail
...,...,...,...,...
8175,"Trudeau invites Scheer, Blanchet, Singh and M...",politics,2019-11-04T00:12:04.506Z,the globe and mail
8176,Don’t blame Ottawa for Encana’s loss,politics,2019-11-03T23:55:14.036Z,the globe and mail
8177,Will backbench MPs seize the power of a minor...,politics,2019-11-03T23:51:38.475Z,the globe and mail
8178,Refugee advocates set to challenge Canada’s b...,politics,2019-11-03T23:43:31.132Z,the globe and mail


In [25]:
# Save to csv

articles_df_gm.to_csv('data/globeandmail.csv')

***

## The Onion 🧅
<a id='onion'></a>

From the Onion, I will scrape headlines from `Politics`, `Sports`, `Local`, and `Entertainment` sections.

#### Loops

In [4]:
# Specify a list of topic
topic_list_o = ['politics', 'sports', 'local', 'entertainment']

titles_list_o = []
topics_o = []
date_o = []
source_o = []

I will extract headline information from the panel section of each page. I will then click on the `More Stories` button. Similar to the Beaverton code, I will build in the try and except code. The timestamp for the articles are quite peculiar as there are 2 elements under the same class. Each iteration extracts 2 duplicate dates. I could not find the difference between the two class_ specification, so I will create a new list that just drops duplicated date as it is very unlikely that the Onion would publish 2 articles at the exact same time in the same topic.

In [5]:
# Function that will go to the first website, scrape all the data, append all the articles, date, 'onion' as source, and topics to empty lists.

def scrape_onion(topic):
    
        websource_o = browser.page_source

        soup = BeautifulSoup(websource_o)

        articles = soup.find_all('h2', class_ = ['sc-759qgu-0 cYlVdn cw4lnv-6 eXwNRE',
                                                 'sc-759qgu-0 sc-759qgu-1 jmAmn sc-3kpz0l-8 bbgWOc'])

        dates = soup.find_all('time', class_ = 'uhd9ir-0 gWMcOL cjg713-0 jElrJy')
        
        # Drop duplicated timestamptb
        clean_dates = list(pd.Series(dates).drop_duplicates())

        for article in articles:
            titles_list_o.append(article.text)
            topics_o.append(topic)
            source_o.append('the onion')


        for date in clean_dates:
            date_o.append(date['datetime'])

    
for topic in topic_list_o:
    browser.get(f'https://{topic}.theonion.com/')
    
    while True:
        try:
            scrape_onion(topic)
            browser.find_element_by_xpath('/html/body/div[3]/div[4]/main/div/div[5]/div/a/button').click()
            time.sleep(5)
        except NoSuchElementException:
            break

It seems that it was working just fine. I will create a df from the lists, and hope that the length of dates match up with the rest of the lists 🤞. 

In [6]:
# Compile the df

articles_df_o = pd.DataFrame(
    {'title': titles_list_o,
     'topic': topics_o,
     'date_published': date_o,
     'source': source_o
    })

It worked 🙌! I assume the dates for each of these articles are correct, since they line up over topics as well.

In [7]:
# Check!

articles_df_o

Unnamed: 0,title,topic,date_published,source
0,Highlights Of The 2020 Democratic National Con...,politics,2020-08-21T13:35:00-05:00,the onion
1,Congressional Republicans Grill Postmaster Gen...,politics,2020-08-21T13:00:00-05:00,the onion
2,"Bloomberg Looks Straight Into Camera, Silently...",politics,2020-08-20T21:20:00-05:00,the onion
3,"‘Milwaukee Is A Great City On A Great Lake,’ S...",politics,2020-08-20T11:20:00-05:00,the onion
4,DNC Speakers Can’t Believe They’re Giving Prim...,politics,2020-08-20T11:00:00-05:00,the onion
...,...,...,...,...
20195,"Hugh Hefner Comes Out of Retirement, Changes P...",entertainment,1996-02-20T18:18:00-06:00,the onion
20196,Nine Drawn and Quartered at Renaissance Fair,entertainment,1995-12-18T18:38:00-06:00,the onion
20197,Congress Hires Drummer,entertainment,1995-12-11T18:55:00-06:00,the onion
20198,Sonic Booms,entertainment,1993-12-06T18:09:00-06:00,the onion


In [9]:
# Save to csv

articles_df_o.to_csv('data/theonion.csv')

***

## The New York Times 🗽
<a id='nyt'></a>

The New York Times has an API from which I could scrape all of the articles published in each specified month and year. For the sake of simplicity, I will scrape the latest 4 months of NYT articles... Each request returns a gigantic amount of data, so 4 API requests return more data than I have from the other sources. This may create an imbalance in my dataset in terms of time, however it may be good enough get going for now. I will return and rescrape once I re-evaluate. The API request returns a json, from which I could extract my preferred information by calling the keys in each dictionary entries. 

In [11]:
# Specify months to scrape
months = [5, 6, 7, 8]
topic = []
headlines = []
date = []

# Looping over months and getting info from each article through dictionary's keys
for mon in months:
    
    req = requests.get(f'https://api.nytimes.com/svc/archive/v1/2020/{mon}.json?api-key=q8mol0Visv97gwH0dPI7BDMW5SYTegT0')
    
    full_month = req.json()
    
    # Get to the meat of the json
    
    docs = full_month['response']['docs']
    
    for article in docs:
        headlines.append(article['headline']['main'])
    
    for article in docs:
        topic.append(article['news_desk'])
        
    for article in docs:
        date.append(article['pub_date'])
        
    print(f'{mon}/2020 compeleted')

5/2020 compeleted
6/2020 compeleted
7/2020 compeleted
8/2020 compeleted


Everything seemed to work fine. I will now create a df from the lists.

In [12]:
# Make df

articles_df_nyt = pd.DataFrame(
    {'title': headlines,
     'topic': topic,
     'date_published': date,
     'source': 'nyt'
    })

In [13]:
# Check!!

articles_df_nyt

Unnamed: 0,title,topic,date_published,source
0,Seven States to Coordinate on Amassing Medical...,Metro,2020-05-03T13:10:38+0000,nyt
1,The Courage to Be Alone,OpEd,2020-05-01T19:30:09+0000,nyt
2,Tennis Coming Back Slowly With Exhibition Matches,Sports,2020-05-01T05:00:06+0000,nyt
3,The Best Part of the Campaign Trail (the Food!...,Dining,2020-05-01T18:24:58+0000,nyt
4,"Go Ahead, Blow Out the Candles on Zoom",AtHome,2020-05-02T16:00:07+0000,nyt
...,...,...,...,...
24818,"To Test Spread of Coronavirus, These Scientist...",Culture,2020-08-23T18:45:59+0000,nyt
24819,Map: Tracking Tropical Storms Laura and Marco,U.S.,2020-08-23T03:29:21+0000,nyt
24820,How is the Coronavirus Affecting Low-Income Fa...,Reader Center,2020-08-23T05:25:15+0000,nyt
24821,A Film Pantheon That Omits Black Directors,Movies,2020-08-23T14:55:57+0000,nyt


In [15]:
articles_df_nyt.to_csv('data/nyt.csv')

__In building this notebook, I attempted my code a little at the time and I scaled up the code once I am sure that it would work. It took many trial and errors to make a code that would move smoothly over the whole thing and in finding the right classes and tags.__</br>

Now I will go on to clean the data... pt. 2

***

## The New York Times pt 2
<a id='nyt2'></a>

From evaluating the initial models, I found out that the dates from the NYT is actually causing significant bias and skew my model significantly. From earlier EDA, I know that the overall published date ranges from 1993 to 2020. I will scrape all NYT headlines I could from 1993 to 2020. Once I am done with this, I will randomly downsample the v. large amount of data to equal the first iteration (20,000 articles). In preliminary trials, the API throws a `KeyError` randomly. Since we have a lot of data, I will just pass a `continue` through this error and move on. 2020 has only gone on for 8 months, so I will separately scrape 2020 data separately. The rest of the code is very similar to the previous code, with the addition of try, except, and finally!

In [1]:
import random

In [None]:
# Set up the months and empty lists

months = np.arange(1,13)
months_2020 = np.arange(1,9)
years = np.arrange[1993, 2021]
topic = []
headlines = []
date = []

# Loops - through years
for year in years:
    if year != 2020:
        # - through months
        for mon in months:

            req = requests.get(f'https://api.nytimes.com/svc/archive/v1/{year}/{mon}.json?api-key=q8mol0Visv97gwH0dPI7BDMW5SYTegT0')

            full_month = req.json()

            try:
                docs = full_month['response']['docs']

                for article in docs:
                    topic.append(article['section_name'])
                    headlines.append(article['headline']['main'])
                    date.append(article['pub_date'])
                    
                time.sleep(5)
                                
             # where am I getting a key error?       
            except KeyError:
                print(f'key error {mon}/{year}')
                continue
                
            # Show me where I'm at!    
            finally:
                print(f'{mon}/{year} compeleted')
                
    else:
        # year == 2020
        for mon in months_2020:

            req = requests.get(f'https://api.nytimes.com/svc/archive/v1/{year}/{mon}.json?api-key=q8mol0Visv97gwH0dPI7BDMW5SYTegT0')

            full_month = req.json()

            try:
                docs = full_month['response']['docs']

                for article in docs:
                    topic.append(article['section_name'])
                    headlines.append(article['headline']['main'])
                    date.append(article['pub_date'])
                    
                time.sleep(5)
                                
            except KeyError:
                print(f'key error {mon}/{year}')
                continue
                
            finally:
                print(f'{mon}/{year} compeleted')

Alright, everything seems to be in order! Now putting together the dataframe and sampling time!

In [149]:
# Check!

len(date)

55719

Whew 55K articles, definitely need to resample! After creating df!

In [150]:
# Create df

articles_df_nyt = pd.DataFrame(
    {'title': headlines,
     'topic': topic,
     'date_published': date,
     'source': 'nyt'
    })

In [15]:
# Save the whole data to csv

articles_df_nyt.to_csv('data/nyt1.csv')

Now I will simply use the sample function to pare down the massive amount of data to 20,000.

In [None]:
# Sampling

sampled_nyt = articles_df_nyt.sample(20000)

In [None]:
# Save sampled data to csv

sampled_nyt.to_csv('data/sampled_nyt.csv')

Moving on to cleaning and adjusting the data... Then to retrain the models... (continue to pt. 4)