# This notebook is to webscrape the text content of news via beautifulsoup, 

 Word-based approaches, often used in information retrieval settings, are good candidates in terms of system performance, but have some challenges such as coping with synonyms and orthographical variants and defining “queries” from users’ historical activities. 
 
 
## Data source

- Conservative

    - http://www.foxnews.com/

    - https://www.breitbart.com/



- Liberal

    - http://www.cnn.com/

    - https://www.nytimes.com/




### Method

I use [News API](https://newsapi.org) to search worldwide news articles and headlines from all over the web in real-time and collect the urls from each media and extract the text content from the urls.

There are only lmited 500 requests per day for free so I query multiple key words to get the collection of news.


The output of webscraping is loaded into MongDB so the structure of collection should be:


***Data structure in collection***

```python
[{'query' : ,
  'domain' : ,
  'url':,
  'title' : , 
  'article' :}, 
...]
```

# Helper function

In [1]:
import requests
from bs4 import BeautifulSoup
import urllib.request
import json

import datetime 
import dateutil.relativedelta

In [2]:
now = datetime.datetime.now()

# define start date and end date (within a month) to collect news
start_date = now + dateutil.relativedelta.relativedelta(months=-1)
end_date = now

start_date = start_date.strftime("%Y-%m-%d")
end_date = end_date.strftime("%Y-%m-%d")

print(start_date, end_date)

2019-07-19 2019-08-19


In [3]:
from newsapi import NewsApiClient
newsapi = NewsApiClient(api_key='YOUR_API_KEY')

def get_news_url(query, media_name, media_domain, start_day, end_day, page=2):
    
    result = []
    all_articles = newsapi.get_everything(q = query,
                                      sources = media_name,
                                      domains = media_domain,
                                      from_param = start_day,
                                      to = end_day,
                                      language='en',
                                      sort_by='relevancy',
                                      page=page)
    
    for article in all_articles['articles']:
        result.append({'query' : query, 
                       'media_domain' : media_domain, 
                       'title' : article['title'], 
                       'url' : article['url'],
                       'article' : []})
        
    return result

In [4]:
techcrunch_venmo = get_news_url(None,'techcrunch', 'techcrunch.com',start_date, end_date)
techcrunch_venmo

[{'query': None,
  'media_domain': 'techcrunch.com',
  'title': 'TikTok tests an Instagram-style grid and other changes',
  'url': 'http://techcrunch.com/2019/07/19/tiktok-tests-an-instagram-style-grid-and-other-changes/',
  'article': []},
 {'query': None,
  'media_domain': 'techcrunch.com',
  'title': 'Lyft’s dockless e-bikes have made their way to SF, but it wasn’t easy',
  'url': 'http://techcrunch.com/2019/07/21/lyft-e-bikes-san-francisco/',
  'article': []},
 {'query': None,
  'media_domain': 'techcrunch.com',
  'title': 'クックパッドがスマートキッチン業界カオスマップ2019上半期版を公開',
  'url': 'https://jp.techcrunch.com/2019/07/19/food-tech-chaos-map/',
  'article': []},
 {'query': None,
  'media_domain': 'techcrunch.com',
  'title': 'India’s Oyo valued at $10B after founder purchases $2B in shares',
  'url': 'http://techcrunch.com/2019/07/19/indias-oyo-valued-at-10b-after-founder-purchases-2b-in-shares/',
  'article': []},
 {'query': None,
  'media_domain': 'techcrunch.com',
  'title': 'Lyft expands its P

## Extract URL 

In [5]:
def get_url(media_name, media_domain, start_date , end_date , query_list):
    
    news = []
    for query in query_list:
        news.append( get_news_url(query, media_name, media_domain, start_date , end_date) )
    return news

## Get list of key query

In [6]:
query_list = ['gun', 'white supremacist', 'mass shooting', 'jeffrey epstein', 'trump', 
               'impeachment', 'island', 'election', 'russia', 'recession', 'supreme court', 
               'federal reserve', 'gay', 'abortion', 'same-sex', 'war', 
               'artificial intelligence', 'public charge', 'immigration', 'marijuana', 
               'church', 'israel', 'tax', 'education', 'insurance', 'college', 'police', 
               'art', 'space', 'muslim', 'health care', 'jeffrey epstein', 'crime']


#query_list = [ 'church', 'israel', 'tax', 'education', 'insurance', 'college', 'police', 
#               'art', 'space', 'muslim', 'health care', 'jeffrey epstein', 'crime']


print(len(query_list))

33


# Scrape news 

## Scraping Fox news

In [12]:
def scrape_fox_news(url):
    
    text = []
    
    # get page text
    page = requests.get(url)
    # parse with BFS
    soup = BeautifulSoup(page.text, 'html.parser')  
    try:
        body = soup.find(class_=['article-body', 'article-text']).find_all('p')
            
        for i in body:
            #print(i.get_text(), '\n')
            text.append(i.get_text())
        
        return text
    except:
        return []

In [13]:
url = 'https://www.foxnews.com/us/arkansas-leaves-court-juror-flees-sentencing'
#url = 'https://www.foxnews.com/us/georgia-prison-gang-inmate-escapes-officials-say'
scrape_fox_news(url)

["Fox News Flash top headlines for August 5 are here. Check out what's clicking on Foxnews.com",
 'An Arkansas man who was convicted of attempted murder allegedly left court during a recess\xa0— in the company of a juror on his case —\xa0and later fled during his sentencing, according to prosecutors.',
 'Madriekus Blakes, 32, was convicted Tuesday of two counts of attempted first-degree murder and four counts of a terroristic act for a September 2018 shooting at a convenience store in West Memphis, a city roughly 120 miles northeast of Little Rock.',
 '1998 ARKANSAS SCHOOL SHOOTER KILLED IN CAR CRASH 20 YEARS LATER',
 '\n      Madriekus Blakes, 32, left his trial with a juror on his case, and fled during his sentencing the next day, prosecutors said.\n      (Second Judicial District Prosecuting Attorney)',
 'Blakes, according to police, had been fighting with a man he claimed owed him money and fired four shots at the man and his father while they were sitting in a truck outside the st

In [14]:
# Get URL from FOX
fox_news = get_url('fox-news', 'foxnews.com', start_date, end_date, query_list)


# Flatten list of list
fox_news = sum(fox_news, [])

# Extract text content
for i in fox_news:
    #print(i['url'], i['article'])
    if i['url'].find('video.foxnews') == -1:
        i['article'] = scrape_fox_news(i['url'])

In [15]:
# check which url does not contain text content
c = 0
for i in fox_news:
    if len(i['article']) == 0:
        print(i['url'])
        c +=1
c 

http://video.foxnews.com/v/6070874085001/
http://video.foxnews.com/v/6068370715001/
https://www.foxnews.com/us/georgia-prison-gang-inmate-escapes-officials-say
https://www.foxnews.com/us/el-paso-mass-shooting-walmart-20-dead-suspect-in-custody
https://www.foxnews.com/media/buzz-aldrin-predicts-decades-of-trumps-artemis-program-similar-to-apollo
https://www.foxnews.com/media/michael-collins-apollo-11-pilot-disagrees-with-return-to-moon-wants-straight-shot-to-mars
http://video.foxnews.com/v/6073269667001/
http://video.foxnews.com/v/6065942565001/
http://video.foxnews.com/v/6065358862001/
https://www.foxnews.com/politics/howie-kurtz-muellers-testimony-debacle-impeachment-dems
https://radio.foxnews.com/2019/07/26/media-buzzmeter-07-26-2019/
https://radio.foxnews.com/2019/07/25/media-buzzmeter-07-25-2019/
https://www.foxnews.com/world/british-scientist-greek-island-search-geolaction-phone
http://video.foxnews.com/v/6062835853001/
https://radio.foxnews.com/2019/07/22/fnc-senior-political-ana

40

In [17]:
# save collection into json
with open('fox_news_'+ str(end_date) +'.json', 'w') as f:
    json.dump(fox_news , f)

## Scraping CNN news

In [18]:
def scrape_cnn_news(url):
    
    text = []
    # get page text
    page = requests.get(url)
    # parse with BFS
    soup = BeautifulSoup(page.text, 'html.parser')  
    
    body = soup.find_all(class_= ['zn-body__paragraph speakable', 
                                  'zn-body__paragraph', 
                                  'Paragraph__component', 
                                  'Text-sc-1amvtpj-0 render-stellar-contentstyles__List-sc-9v7nwy-1 hWPJAy',
                                  'Text-sc-1amvtpj-0-p render-stellar-contentstyles__Paragraph-sc-9v7nwy-2 fAchMW'
])
    for i in body:
        text.append(i.get_text())
        
    return text

In [19]:
url = 'https://www.cnn.com/2019/08/14/economy/recession-risk-economies/index.html'
scrape_cnn_news(url)

["London (CNN Business)Five big economies are at risk of recession. It won't take much to push them over the edge. ",
 "The British economy shrunk in the second quarter, and growth flat lined in Italy. Data published Wednesday show Germany's economy, the world's fourth largest, contracted in the three months to June. ",
 '"The bottom line is that the German economy is teetering on the edge of recession," said Andrew Kenningham, chief Europe economist at Capital Economics.',
 'Mexico just dodged a recession— usually defined as two consecutive quarters of contraction — and its economy is expected to remain weak this year. And data suggest that Brazil slipped into recession in the second quarter.',
 "Germany, Britain, Italy, Brazil and Mexico each rank among the world's largest 20 economies. Singapore and Hong Kong, which are smaller but still serve as vital hubs for finance and trade, are also suffering. ",
 'While growth has been dragged lower in each country by a specific cocktail of f

In [20]:
# Get URL from CNN
cnn_news = get_url('cnn', 'cnn.com', start_date, end_date, query_list)


# Flatten list of list
cnn_news = sum(cnn_news, [])

# Extract text content
for i in cnn_news:
    i['article'] = scrape_cnn_news(i['url'])

In [21]:
cnn_news[10]

{'query': 'gun',
 'media_domain': 'cnn.com',
 'title': "Murdoch's New York Post urges Trump to ban assault weapons",
 'url': 'https://www.cnn.com/2019/08/05/media/new-york-post-assault-weapons/index.html',
 'article': ["New York (CNN Business)One of President Trump's favorite newspapers, the New York Post, is delivering him a message.",
  '"President Trump, America is scared and we need bold action," Monday\'s front page reads. "It\'s time to... BAN WEAPONS OF WAR."',
  'The blunt cover comes after twin mass shootings in El Paso, Texas, and Dayton, Ohio, over the weekend.',
  'An editorial inside the tabloid calls for "the return of an assault-weapons ban."',
  'The New York Post is controlled by Rupert Murdoch, the right-wing media mogul who keeps in close touch with the president.',
  'This is not the first time the Post has called for such a ban. But it is still striking to see the message on the front page of the paper, where Trump is likely to see it. Trump grew up reading the pap

In [22]:
len(cnn_news)

595

In [23]:
# check which url does not contain text content
c = 0
for i in cnn_news:
    if len(i['article']) == 0:
        print(i['url'])
        c +=1
c   

https://www.cnn.com/videos/politics/2019/08/05/president-donald-trump-mass-shootings-bipartisan-action-remarks-sot-vpx.cnn
https://www.cnn.com/videos/politics/2019/08/09/reality-check-dems-calling-mcconnell-bring-back-senate-gun-reform-avlon-newday-vpx.cnn
https://www.cnn.com/videos/politics/2019/08/09/trump-background-checks-erin-burnett-monologue-ebof-vpx.cnn
https://www.cnn.com/videos/politics/2019/08/15/el-paso-mayor-rino-donald-trump-called-sot-vpx-nr.cnn
https://www.cnn.com/videos/politics/2019/08/12/john-legend-dayton-ohio-shooting-concert-vpx.cnn
https://www.cnn.com/videos/politics/2019/08/05/el-paso-shooting-vigil-beto-orourke-sot-earlystart-vpx.cnn
https://www.cnn.com/videos/us/2019/07/29/california-gilroy-garlic-festival-shooting-witness-gunpowder-christian-swain-bpr-newday-vpx.cnn
https://www.cnn.com/videos/politics/2019/08/09/donald-melania-trump-el-paso-orphaned-baby-photo-newday-sot.cnn
https://www.cnn.com/videos/entertainment/2019/08/07/comedy-central-airs-gun-violence-

54

In [24]:
for i in cnn_news:
    if len(i['article']) == 0:
        i['article'] = scrape_cnn_news(i['url'])

In [25]:
c = 0
for i in cnn_news:
    if len(i['article']) == 0:
        print(i['url'])
        c +=1
c   

https://www.cnn.com/videos/politics/2019/08/05/president-donald-trump-mass-shootings-bipartisan-action-remarks-sot-vpx.cnn
https://www.cnn.com/videos/politics/2019/08/09/reality-check-dems-calling-mcconnell-bring-back-senate-gun-reform-avlon-newday-vpx.cnn
https://www.cnn.com/videos/politics/2019/08/09/trump-background-checks-erin-burnett-monologue-ebof-vpx.cnn
https://www.cnn.com/videos/politics/2019/08/15/el-paso-mayor-rino-donald-trump-called-sot-vpx-nr.cnn
https://www.cnn.com/videos/politics/2019/08/12/john-legend-dayton-ohio-shooting-concert-vpx.cnn
https://www.cnn.com/videos/politics/2019/08/05/el-paso-shooting-vigil-beto-orourke-sot-earlystart-vpx.cnn
https://www.cnn.com/videos/us/2019/07/29/california-gilroy-garlic-festival-shooting-witness-gunpowder-christian-swain-bpr-newday-vpx.cnn
https://www.cnn.com/videos/politics/2019/08/09/donald-melania-trump-el-paso-orphaned-baby-photo-newday-sot.cnn
https://www.cnn.com/videos/entertainment/2019/08/07/comedy-central-airs-gun-violence-

54

In [26]:
# save the collection into json
with open('cnn_news_'+ str(end_date) +'.json', 'w') as f:
    json.dump(cnn_news , f)

In [27]:
len(fox_news), len(cnn_news)

(625, 595)

## Scraping NYTimes news

In [28]:
def scrape_nytimes_news(url):
    
    text = []
    # get page text
    page = requests.get(url)
    # parse with BFS
    soup = BeautifulSoup(page.text, 'html.parser')  
    
    body = soup.find_all('p')
    
    for i in body:
        text.append(i.get_text())
        
    return text

In [30]:
# Get URL from NYTimes
nytimes_news = get_url('the-new-york-times', 'nytimes.com', start_date, end_date, query_list)

# Flatten list of list
nytimes_news = sum(nytimes_news, [])

# Extract text content
for i in nytimes_news :
    i['article'] = scrape_nytimes_news(i['url'])

In [32]:
# check which url does not contain text content
c = 0
for i in nytimes_news:
    if len(i['article']) == 0:
        print(i['url'])
        c +=1
c  

https://www.nytimes.com/aponline/2019/07/30/world/asia/ap-as-india-muslim-divorce-law.html
https://www.nytimes.com/aponline/2019/07/28/business/ap-us-election-2020-democrats-fact-check.html
https://www.nytimes.com/reuters/2019/08/02/technology/02reuters-at-s-austria-results.html
https://www.nytimes.com/aponline/2019/07/26/world/asia/ap-as-philippines-earthquake.html
https://www.nytimes.com/aponline/2019/07/30/world/asia/ap-as-india-muslim-divorce-law.html


5

In [31]:
len(nytimes_news)

573

In [33]:
# save the collection into json
with open('nytimes_news_'+ str(end_date) +'.json', 'w') as f:
    json.dump(nytimes_news , f)

## Scraping Breitbart news

In [34]:
def scrape_breitbart_news(url):
    
    text = []
    # get page text
    page = requests.get(url)
    # parse with BFS
    soup = BeautifulSoup(page.text, 'html.parser')  
    
    body = soup.find_all('p')
    
    for i in body:
        text.append(i.get_text())
        
    return text

In [35]:
url = 'https://www.breitbart.com/politics/2019/08/09/washington-post-explains-why-some-el-paso-survivors-support-trump/'
scrape_breitbart_news(url)

['Breitbart News reported on National Public Radio’s (NPR) effort to find Trump haters in El Paso and instead the tax-payer funded media outlet came face to face with Tito Anchondo, who lost his brother and sister-in-law. Anchondo said his brother and his whole family are Republicans and support Trump.',
 'Now the Washington Post has\xa0profiled Anchondo in a story headlined, “Why one family mourning El Paso victims chose to meet with Trump” and embedded a video on top of the story featuring people who refused to meet with Trump in Texas.',
 '“Tito Anchondo wishes people would stop politicizing his family’s tragedy,” the Post reported before it proceeded with a political spin:',
 'Melania Trump posted a photo Thursday on Twitter showing the meeting with Tito Anchondo, his sister, Deborah Ontiveros, and the infant. In the photo, Melania holds the baby, while Trump smiles and gives a thumbs-up — an image that drew anger on social media. Some criticized the president’s facial expression a

In [37]:
# Get URL from breitbart
breitbart_news = get_url('breitbart-news', 'breitbart.com', start_date, end_date, query_list)

# Flatten list of list
breitbart_news = sum(breitbart_news, [])


# Extract text content
c = 1
for i in breitbart_news :
    print(c)
    i['article'] = scrape_breitbart_news(i['url'])
    c+=1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


In [38]:
# check which url does not contain text content
c = 0
for i in breitbart_news:
    if len(i['article']) == 0:
        print(i['url'])
        c +=1
c  

0

In [39]:
# save the collection into json
with open('breitbart_news_'+ str(end_date) +'.json', 'w') as f:
    json.dump(breitbart_news , f)

In [40]:
!ls *json

breitbart_news_2019-08-19.json merged_file.json
cnn_news_2019-08-19.json       nytimes_news_2019-08-19.json
fox_news_2019-08-19.json
