# Political News Stories Collection
## A script to gather political stories from CNN.com and foxnews.com
## By: Bryan Kolano, December 7th, 2022

***

#### Background
I was trying to think of a new project that involved text and classification.  After thinking for a while, I got an idea about how doing analysis of political stories might be an interest topic to tackle.  With CNN and Fox News, the language they use is very different, they focus on different topics, and their coverages of the same topics are typically very different.  <br>

The point of this script to is grab new headlines each day from each news sources and then combine them in a CSV.  After a couple hundred (maybe a couple thousands) stories are collected, then I plan to do analysis in a different script.  I want to examine differences between the difference sources, and I also plan to test various machine learn algorithms to see if they can correctly classify whether a text comes from CNN or Fox News.

#### Import packages

In [5]:
#packages for scraping and driver
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

#baseline python packages needed
from datetime import datetime
import re

#Pandas for data collection/ indexing
import pandas as pd

#### Collection of CNN Articles

From 2019-2021, while I was part of Intelligence and Security Command (INSCOM) Data Science Team, we used to teach a class called "data Analytics in R" (called OS305) to open source intelligence analysts.  These analysts were soldiers, army civilians, and government contractors who were looking to be able to do data analysis on data collected from the open internet. <br>

These student had minimal experience in R; they had only taken a small coding bootcamp before taking OS305.  As part of OS305, I gave a block of instruction on webscraping.  For the class, we scraped a couple of pre-determined websites to show a website's HTML and CSS and then scrape the data. <br>

During one iteration of the class, I decided to call an audible and grab a random website to scrape.  I broke a cardinal rule of teaching coding: don't live code in front of students, haha.  I chose CNN.com and tried to webscrape it using R.  No matter how many ways I tried to manipulate the HTML and CSS structure, I could not pull the information I wanted.  I told the class that I would look into a get back to them.  <br>

As it turns out, CNN among many other websites uses JavaScript to render content once they are loaded in the browser.  In other words, in webscraping, you're trying to grab information that doesn't exist yet because you are making a GET request before the content is loaded in the browser.  At the time, I did not know CNN did that, and alas, webscraping would not work for CNN and I shared with the class the reason.
<br>

Due to the way CNN renders content, it is necessary to use webdriving to navigate CNN and gather data.  Therefore, the following section using Selenium to gather the page information and then use Beautiful Soup to parse the HTML document.  <br>

After grabbing the information from each news article, I turn it into a Pandas Dataframe and then write the results to a CSV called "news_articles."


In [7]:

#define CNN urls 
cnn_base_url = 'https://www.cnn.com'
cnn_politics_url = 'https://www.cnn.com/politics'

#create Selenium driver element
options = Options()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
options.add_experimental_option('excludeSwitches', ['enable-logging'])
service = Service(executable_path='D:\Projects\gun_violence\chromedriver.exe')
driver = webdriver.Chrome(service=service, options= options)

try:
    #go to the politics page of CNN
    driver.get(cnn_politics_url)

    #collect HTML of driver and turn into BS element
    soup = BeautifulSoup(driver.page_source,'html.parser' )

    #grab the html of titles of all articles on the page
    politic_titles = soup.select('.container__headline')

    #create list of all titles
    cnn_titles = [title.text.strip() for title in politic_titles]

    #create list of all titles' URLs extensions
    url_extensions = [url['href'] for item in soup.select('div.container_lead-plus-headlines__field-links') for url in item.select('a') ]

    #combine base URL with each URL extension
    cnn_urls = [cnn_base_url + url for url in url_extensions]

    if len(cnn_urls) == 0:
        raise ValueError('Need to recheck the CNN code, not pulling URLs')

    #set up blank lists to append to
    cnn_text_of_articles = []
    cnn_date = []

except WebDriverException:
    print('Error getting main page')

#loops across all URLs on the politics page
for url in cnn_urls:
    
    try:
        #tell driver to grab each url
        driver.get(url)

        #turn each page into BS element and grab HTML
        page_soup = BeautifulSoup(driver.page_source,'html.parser' )
        
        #find HTML section that contains the text of the article
        article_contents =  page_soup.select('body > div.layout__content-wrapper.layout-with-rail__content-wrapper > section.layout__wrapper.layout-with-rail__wrapper > section.layout__main-wrapper.layout-with-rail__main-wrapper > section.layout__main.layout-with-rail__main > article > section > main > div.article__content-container > div.article__content > p')

        #take all the <p> of the article sections and join them together
        article_text = ' '.join([x.text.strip() for x in article_contents]) 

        #append article text to our holder list
        cnn_text_of_articles.append(article_text)

        #Need to grab the date from the article
        #find the line in the page HTML that has the date
        date_line = page_soup.select('div.timestamp')[0].text.strip()
        
        #create regex object to rip out the date
        date_re = re.compile(r'\w{3,}\s\d{1,2},\s\d{4}')
        #find the date from the pattern
        date = date_re.findall(date_line)[0]
        #turn into datetime object
        date = datetime.strptime(date, '%B %d, %Y')
        #return the date as a string in the format MM/DD/YYYY
        current_date = f"{date.month}/{date.day}/{date.year}"
        #append the current date to the date holder list
        cnn_date.append(current_date)

    except:
        continue

driver.close()

#create dataframe with all our of filled lists
cnn_df = pd.DataFrame(list(zip(cnn_titles, cnn_date, cnn_urls, cnn_text_of_articles)), columns = ['title','date','url', 'article_text'])

#creat new column in dataframe with the source of the articles
cnn_df['source'] = 'CNN'

#write the dataframe to a CSV
cnn_df.to_csv('news_articles.csv', index = False, header = False, mode= 'a')


#### Collection of Fox News Articles
Fortunately, Fox news does not render its content in the same way CNN does.  Therefore, a standard webscrape with the requests can be used to grab the HTML.  Webdriving is unecessary for grab Fox News data, so I can simply make the GET request and then parse the HTML response with Beautiful Soup.

After grabbing all article information, I turn it into a Pandas Dataframe and then write to the same CSV I am adding all the CNN articles to. 

In [8]:
#grab base FOX URLs
fox_base_url = 'https://www.foxnews.com'
fox_politics_url = 'https://www.foxnews.com/politics'

#make GET request to Fox New's politics page and grab the HTML
resp = requests.get(fox_politics_url).text

#turn HTML into Beautiful Soup Object
fox_soup = BeautifulSoup(resp, 'html.parser')

#set up blank holder lists
fox_urls = []
fox_titles = []
fox_text_of_articles = []
fox_date = []

#loop across all articles on the politics home page
for article in fox_soup.select('main.main-content .content .article'):
    
    #a few of the elements in this particular CSS selector cause errors, so errors will be skipped with this try/ except
    try:
        #some of the links are video "articles" and I don't want to scrape those pages; there is very little information
        if 'VIDEO' in article.text:
            continue
        #Take the URL extension, concatanate it with the base URL, and then add to the holder list
        fox_urls.append(fox_base_url + article.find('a')['href'])
            
            
    except:
        continue

if len(fox_urls) == 0:
    raise ValueError('Need to recheck the Fox code, not pulling URLs')

#Loop across all URLs on the politics page to grab their article information
for url in fox_urls:
    
    
    #GET request of each page, grab the HTML text, and turn into BS object
    html = requests.get(url).text
    soup = BeautifulSoup(html,'html.parser')

    #Grab article title and append to holder list
    current_title = soup.select("h1.headline")[0].text
    fox_titles.append(current_title)

    #find the article section <p>s
    article_text_sections = soup.select('#wrapper > div.page-content > div.row.full > main > article > div > div.article-content > div > p')

    #grab all the paragraph element texts and join them together.    
    current_article = ' '.join([p.text for p in article_text_sections])
    #append the current article to the holder list
    fox_text_of_articles.append(current_article)

    #find the html section with the date
    date_line = soup.select('#wrapper > div.page-content > div.row.full > main > article > header > div.article-meta.article-meta-upper > div.article-date > time')[0].text
    #create regex element to rip the date out of that line
    date_re = re.compile(r'\w{3,}\s\d{1,2},\s\d{4}')
    #find the date with the regex pattern
    date = date_re.findall(date_line)[0]
    #turn found date pattern into datetime element
    date = datetime.strptime(date, '%B %d, %Y')
    #turn date element into string in format "MM/DD/YYYY"
    current_date = f"{date.month}/{date.day}/{date.year}"
    #append date string to holder list
    fox_date.append(current_date)

#Create fox news dataframe to organize all collected information
fox_df = pd.DataFrame(list(zip(fox_titles, fox_date, fox_urls, fox_text_of_articles)), columns = ['title','date','url', 'article_text'])

#create new column with the source of these articles
fox_df['source'] = 'Fox'

#append the dataframe the new article CSV.
fox_df.to_csv('news_articles.csv', index = False, header = False, mode= 'a')