# Data Scrapping
* In this notebook we'll scrape news articles from `Fox` and `CNN` for further analysis. 
* In order focus on the functionality we'll limit the scope of news to the `politics` section of the website. 
* Running this notebook should create atleast 2 `CSV` files with data from both news source. 

## Table Of Contents

## Dependency Installations

In [12]:
! pip install beautifulsoup4
! pip install fake_useragent

/bin/bash: /home/gaurang/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
/bin/bash: /home/gaurang/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Imports

In [2]:
import requests
import random
import time
import csv 
import json

import pandas as pd

from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry


## Common Utilities

### Setting User-Agent

In [14]:
# setting up the user agent
ua = UserAgent()

## setting up the headers
headers = {'User-Agent': ua.chrome}

### Setting Request Session

In [15]:
## Not sure if it is a best practise, but setting a global request session to add retries and backoff factor.
# create a session object
session = requests.Session()

retry = Retry(connect=3, backoff_factor=1)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)

### Helper Functions

In [16]:
## helper function to scrape metadata for foxnews
def scrape_fox_meta_data(num_pages):
   scrapped_metadata = []
   url = "https://www.foxnews.com/api/article-search?searchBy=categories&values=fox-news%2Fpolitics&size={size}&from={fromParam}&mediaTags=primary_politics"
   for i in range(1, num_pages + 1):
      print("Scraping page: ", i)
      data = session.get(url.format(size=10, fromParam=i)).json()
      scrapped_metadata = scrapped_metadata + list(filter(lambda metadata: metadata["category"]["name"] != "VIDEO", data))
      time.sleep(random.randint(1, 3))
   return scrapped_metadata


## helper functions to write data to csv file
def write_to_csv(filename, data):
     with open(filename, "w") as f:
          writer = csv.DictWriter(f, fieldnames=data[0].keys())
          writer.writeheader()
          writer.writerows(data)     

# Lets define a scraper functions to get the text from the article

## helper function to get html content for the url
def lovely_soup(url):
    ## adding timeout to avoid getting blocked
    r = session.get(url, headers = headers, timeout=10)
    return BeautifulSoup(r.text, 'html.parser')

## helper function to get the text from the article
def get_fox_article_text(url):
   ## Adding try/catcch block to handle the exception
   try:
      print("getting content for url: ", url)
      ## adding random sleep here to avoid getting blocked
      time.sleep(random.randint(1, 5))
      soup = lovely_soup(url)
      ## assuming there is only one article json. 
      article_json = json.loads(soup.findAll('script', {'type': 'application/ld+json'})[0].text)
      ## Article content is saved in `articleBody` key
      article_text = article_json["articleBody"]
      return article_text
   except Exception as e:
      print("Error while getting the content for url: ", url)
      print(e)
      return ""


## Scrapping Fox News

### Notes
* After exploring the `foxnews.com` website, we've found that `get` call on following URL gives us the list of news articles `metadata` and actual article URL. 

```
https://www.foxnews.com/api/article-search?searchBy=categories&values=fox-news%2Fpolitics&size=11&from=1&mediaTags=primary_politics
```
* In above URL request param `size` determines the number of results returned and `from` determines starting page number. 
* From the given `metadata` we'll filter out the articles which are just `VIDEO` articles and create a data-set.
* We'll loop thru the `metadata` do a get call actual article URL scrape and record the article contents


### Scrapping Meta Data

In [17]:
# comment this out if you have already scraped the data
# the number 9989 was selected by querying the api on browser and checking the total number of articles
# metadata = scrape_fox_meta_data(9989)

In [18]:

## commenting this out as we have already saved the data and accidentally running this will overwrite the data
#write_to_csv("fox_metadata.csv", metadata)

In [19]:
## reading the metadata from the csv file
csv_metadata = pd.read_csv("fox_metadata.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'fox_metadata.csv'

In [None]:
## quick EDA on metadata
csv_metadata.head()

Unnamed: 0,imageUrl,title,description,url,publicationDate,lastPublishedDate,category,isBreaking,isLive,duration,authors
0,https://static.foxnews.com/foxnews.com/content...,Hassan and Bolduc trade fire in final showdown...,A bystander took a swing at Republican Senate ...,/politics/hassan-bolduc-trade-fire-final-showd...,2022-11-02T22:47:00-04:00,2022-11-02T22:47:00-04:00,"{'name': 'New Hampshire', 'url': '/category/us...",False,False,,[{'name': 'Paul Steinhauser'}]
1,https://a57.foxnews.com/static.foxnews.com/fox...,Biden suggests voting for Republicans is a thr...,President Biden said the only way to repudiate...,/politics/biden-speech,2022-11-02T19:35:10-04:00,2022-11-02T22:15:46-04:00,"{'name': 'Joe Biden', 'url': '/category/person...",False,False,,[{'name': 'Haris Alic'}]
2,https://a57.foxnews.com/static.foxnews.com/fox...,NYC's Naked Cowboy makes endorsement for gov w...,New York City's Naked Cowboy endorsed Lee Zeld...,/politics/nyc-naked-cowboy-makes-endorsement-w...,2022-11-02T21:58:25-04:00,2022-11-02T21:58:25-04:00,"{'name': 'New York City', 'url': '/category/us...",False,False,,[{'name': 'Adam Sabes'}]
3,https://static.foxnews.com/foxnews.com/content...,Hassan and Bolduc trade fire in final showdown...,A bystander took a swing at Republican Senate ...,/politics/hassan-bolduc-trade-fire-final-showd...,2022-11-02T22:47:00-04:00,2022-11-02T22:47:00-04:00,"{'name': 'New Hampshire', 'url': '/category/us...",False,False,,[{'name': 'Paul Steinhauser'}]
4,https://a57.foxnews.com/static.foxnews.com/fox...,Biden suggests voting for Republicans is a thr...,President Biden said the only way to repudiate...,/politics/biden-speech,2022-11-02T19:35:10-04:00,2022-11-02T22:15:46-04:00,"{'name': 'Joe Biden', 'url': '/category/person...",False,False,,[{'name': 'Haris Alic'}]


In [None]:
## lets change the date to a datetime object
csv_metadata["publicationDate"] = pd.to_datetime(csv_metadata["publicationDate"])

In [None]:
csv_metadata["publicationDate"].min()

datetime.datetime(2021, 2, 4, 12, 33, 23, tzinfo=tzoffset(None, -18000))

In [None]:
csv_metadata.shape

(39787, 11)

##### Notes
* So we have meta-data of `39787` records. We'll need to parse through these records and then try and scrape the actuall news article. 
* The field we are interested in is `url`, which is relative URL, so we'll need to convert it into absolute URL and then get the article.  

In [None]:
## First step lets convert relative URLs to absolute URLs
csv_metadata["url"] = csv_metadata["url"].apply(lambda url: "https://www.foxnews.com" + url)
csv_metadata.head()["url"]

0    https://www.foxnews.com/politics/hassan-bolduc...
1        https://www.foxnews.com/politics/biden-speech
2    https://www.foxnews.com/politics/nyc-naked-cow...
3    https://www.foxnews.com/politics/hassan-bolduc...
4        https://www.foxnews.com/politics/biden-speech
Name: url, dtype: object

##### Notes
* After some manual investigation we found that the actual news content is passed as a `json` object
* This makes our life a bit easy, we don't have to scrap thru multiple `html` tags but just query a single `script` tag. 
* To help with that we've written a helper function `get_fox_article_text` that would give us the article for given function. 
* Now we'll loop through out dataset and fill in the article content. 

In [None]:
## lets get the text for the articles
csv_metadata["text"] = csv_metadata["url"].apply(get_fox_article_text)

getting content for url:  https://www.foxnews.com/politics/hassan-bolduc-trade-fire-final-showdown-after-gop-nominee-comes-under-attack-arriving-debate
getting content for url:  https://www.foxnews.com/politics/biden-speech
getting content for url:  https://www.foxnews.com/politics/nyc-naked-cowboy-makes-endorsement-while-performing-times-square-restore-law-order
getting content for url:  https://www.foxnews.com/politics/hassan-bolduc-trade-fire-final-showdown-after-gop-nominee-comes-under-attack-arriving-debate
getting content for url:  https://www.foxnews.com/politics/biden-speech
getting content for url:  https://www.foxnews.com/politics/nyc-naked-cowboy-makes-endorsement-while-performing-times-square-restore-law-order
getting content for url:  https://www.foxnews.com/politics/hassan-bolduc-trade-fire-final-showdown-after-gop-nominee-comes-under-attack-arriving-debate
getting content for url:  https://www.foxnews.com/politics/biden-speech
getting content for url:  https://www.foxnew

In [None]:
csv_metadata.loc[39785,"text"]

'FIRST ON FOX: New documents exclusively obtained by Fox News Digital reveal that the U.S. Army is teaching West Point cadets critical race theory (CRT), including addressing "whiteness." Fox News Digital exclusively obtained the documents from government watchdog group Judicial Watch, whichhad to sue the military twice under the Freedom of Information Act (FOIA) to get the information. "Our military is under attack – from within," Judicial Watch president Tom Fitton said in the press release. "These documents show racist, anti-American CRT propaganda is being used to try to radicalize our rising generation of Army leadership at West Point." MICHIGAN DAD, A MARINE VET, SAYS CRITICAL RACE THEORY CONTRARY TO VALUES LEARNED IN MILITARY  Fitton told Fox News Digital that the material was obtained as part of a request for documents related to the instruction of cadets.&nbsp; Judicial Watch received over 600 pages of documents from the two lawsuits that were levied after the Department of De

In [None]:
csv_metadata.to_csv("fox_data.csv", index=False)

NameError: name 'csv_metadata' is not defined

In [3]:
news_data = pd.read_csv("./data/fox_data.csv")

In [4]:
news_data.shape

(39787, 12)

## Check for missing data


In [24]:
news_data.isnull().mean()

imageUrl             0.000000
title                0.000000
description          0.000503
url                  0.000000
publicationDate      0.000000
lastPublishedDate    0.000000
category             0.000000
isBreaking           0.000000
isLive               0.000000
duration             1.000000
authors              0.000000
text                 0.000804
dtype: float64

In [30]:
news_data[news_data["text"].isnull()]

Unnamed: 0,imageUrl,title,description,url,publicationDate,lastPublishedDate,category,isBreaking,isLive,duration,authors,text
367,https://a57.foxnews.com/static.foxnews.com/fox...,George Soros spends big in last-minute attempt...,Billionaire George Soros has poured more than ...,https://www.foxnews.com/politics/george-soros-...,2022-11-02 12:17:12-04:00,2022-11-02T12:17:12-04:00,"{'name': 'POLITICS', 'url': '/category/politics'}",False,False,,[{'name': 'Joe Schoffstall'}],
374,https://a57.foxnews.com/static.foxnews.com/fox...,George Soros spends big in last-minute attempt...,Billionaire George Soros has poured more than ...,https://www.foxnews.com/politics/george-soros-...,2022-11-02 12:17:12-04:00,2022-11-02T12:17:12-04:00,"{'name': 'POLITICS', 'url': '/category/politics'}",False,False,,[{'name': 'Joe Schoffstall'}],
11569,https://a57.foxnews.com/static.foxnews.com/fox...,Political cartoons of the day,,https://www.foxnews.com/politics/cartoons-slid...,2021-02-04 12:33:23-05:00,2022-09-29T13:30:57-04:00,"{'name': 'POLITICS', 'url': '/category/politics'}",False,False,,[],
11578,https://a57.foxnews.com/static.foxnews.com/fox...,Political cartoons of the day,,https://www.foxnews.com/politics/cartoons-slid...,2021-02-04 12:33:23-05:00,2022-09-29T13:30:57-04:00,"{'name': 'POLITICS', 'url': '/category/politics'}",False,False,,[],
11587,https://a57.foxnews.com/static.foxnews.com/fox...,Political cartoons of the day,,https://www.foxnews.com/politics/cartoons-slid...,2021-02-04 12:33:23-05:00,2022-09-29T13:30:57-04:00,"{'name': 'POLITICS', 'url': '/category/politics'}",False,False,,[],
11596,https://a57.foxnews.com/static.foxnews.com/fox...,Political cartoons of the day,,https://www.foxnews.com/politics/cartoons-slid...,2021-02-04 12:33:23-05:00,2022-09-29T13:30:57-04:00,"{'name': 'POLITICS', 'url': '/category/politics'}",False,False,,[],
11604,https://a57.foxnews.com/static.foxnews.com/fox...,Political cartoons of the day,,https://www.foxnews.com/politics/cartoons-slid...,2021-02-04 12:33:23-05:00,2022-09-29T13:30:57-04:00,"{'name': 'POLITICS', 'url': '/category/politics'}",False,False,,[],
11612,https://a57.foxnews.com/static.foxnews.com/fox...,Political cartoons of the day,,https://www.foxnews.com/politics/cartoons-slid...,2021-02-04 12:33:23-05:00,2022-09-29T13:30:57-04:00,"{'name': 'POLITICS', 'url': '/category/politics'}",False,False,,[],
11620,https://a57.foxnews.com/static.foxnews.com/fox...,Political cartoons of the day,,https://www.foxnews.com/politics/cartoons-slid...,2021-02-04 12:33:23-05:00,2022-09-29T13:30:57-04:00,"{'name': 'POLITICS', 'url': '/category/politics'}",False,False,,[],
11627,https://a57.foxnews.com/static.foxnews.com/fox...,Political cartoons of the day,,https://www.foxnews.com/politics/cartoons-slid...,2021-02-04 12:33:23-05:00,2022-09-29T13:30:57-04:00,"{'name': 'POLITICS', 'url': '/category/politics'}",False,False,,[],


## Handling Missing Data

In [8]:
news_data[news_data["text"].isnull()]["url"].unique()
# news_data.loc[news_data["text"].isnull(), "text"] = news_data[news_data["text"].isnull()]["url"].apply(get_fox_article_text)

array(['https://www.foxnews.com/politics/cartoons-slideshow',
       'https://www.foxnews.com/politics/photos-president-trump-family-gather-funeral-ivana-trump',
       'https://www.foxnews.com/politics/supreme-court-overturns-roe-v-wade-photos-protesters-crowds-outside'],
      dtype=object)

In [37]:
news_data.isnull().mean()

imageUrl             0.000000
title                0.000000
description          0.000503
url                  0.000000
publicationDate      0.000000
lastPublishedDate    0.000000
category             0.000000
isBreaking           0.000000
isLive               0.000000
duration             1.000000
authors              0.000000
text                 0.000000
dtype: float64

In [36]:
news_data.to_csv("./data/fox_data.csv", index=False)