#  Web scrapping different news websites for Covid-19 headlines. 

This notebook deals with web scrapping. The data on the websites are unstructured, web scraping helps collect these unstructured data and store it in a structured form. There are different ways to scrap websites such as online services, APIs or by writing our own code. In this article, we’ll see how to implement web scraping with python using BeautifulSoup library.
 
We are basically scrapping three websites which belong to three different news organizations (CNN, NBC, CNBC) for Covid-19 related headlines. This additional data obtained by web scrapping these websites can be used for complementing any existing dataset related to Covid-19 to perform better analysis.

In [19]:
#importing all the necessary libraries
from datetime import date
from bs4 import BeautifulSoup
import requests
import spacy
import en_core_web_sm
import pandas as pd

In [20]:
#collecting all the news website urls
cnn_url= 'https://www.cnn.com/world/live-news/coronavirus-pandemic-07-07-20-intl/index.html'
nbc_url= "https://www.nbcnews.com/health/coronavirus"
cnbc_rss_url = "https://www.cnbc.com/id/10000108/device/rss/rss.html"    

So before scrapping the websites it's important to understand their formats:

- As we can see that the CNN website has a date attached to it and they update only the date for each new day and the rest remains same and thus it is a dynamic html website.
- For the CNBC website, we are using the RSS feed from the CNBC website, the RSS feed is basically an XML file.
- The NBC news url doesn't have any date and it's a simple html website.

In [21]:
#collecting the urls, format of the page, tags(under which the headlines are present) and website names for their identification in respective lists
urls = [cnn_url,nbc_url,cnbc_rss_url]
formats=['html.parser','html.parser','xml']
tags = ['h2','h2','description']
website = ['CNN','NBC','CNBC']

For better understanding we will first start by scrapping only the CNN website:

In [22]:
#setting the date format according to the date format used in the cnn url
today = date.today()
d = today.strftime('%m-%d-%y')
print('date =',d)

date = 07-22-20


In [23]:
#getting the html 
html = requests.get(cnn_url).text

In [24]:
#creating a soup object
soup = BeautifulSoup(html)
print(soup.title)#printing the title

<title data-rh="true">July 7 coronavirus news</title>


To get an idea about context of the headlines in the news we are using the named entity extraction of the spacy library

In [25]:
#loading the entity extraction module of spacy library
nlp = en_core_web_sm.load()

In [26]:
#printing the headlines and the named entity types of the context talked about in the news headlines in the CNN website
for link in soup.find_all('h2'): #finding all the h2 html tags as the CNN website contains the headlines under this tag
    
    print("Headline : {}".format(link.text))
    for ent in nlp(link.text).ents:
        print("\tText : {}, Entry : {}".format(ent.text,ent.label_))

Headline : What you need to know
Headline : Study finds coronavirus associated with neurological complications
Headline : Colombia extends coronavirus lockdown measures
	Text : Colombia, Entry : GPE
Headline : Washington state governor blames Southern states reopening early for late Covid-19 test results
	Text : Washington, Entry : GPE
	Text : Southern, Entry : NORP
Headline : South Dakota governor says she tested negative for Covid-19 after Fourth of July event
	Text : South Dakota, Entry : GPE
	Text : Fourth of July, Entry : DATE
Headline : Texas Republicans have no plans to cancel in-person convention in Houston 
	Text : Texas, Entry : GPE
	Text : Republicans, Entry : NORP
	Text : Houston, Entry : GPE
Headline : Columbia University will welcome back 60% of undergraduate students in the fall
	Text : Columbia University, Entry : ORG
	Text : 60%, Entry : PERCENT
Headline : More than 45,000 new coronavirus cases reported in Brazil
	Text : More than 45,000, Entry : CARDINAL
	Text : Brazi

Collecting headlines for all the websites now:

In [15]:
#crawling through the required web pages through their urls and printing the headlines and named entities associated 
crawl_len = 0
for url in urls:
    print("Crawling webpage ...{}".format(url))
    response = requests.get(url)
    soup = BeautifulSoup(response.content,formats[crawl_len])
    
    for link in soup.find_all(tags[crawl_len]):
        
        if(len(link.text.split(" ")) > 4):
            print("Headline : {}".format(link.text))
            
            entities=[]
            for ent in nlp(link.text).ents:
                print("\tText : {}, Entity : {}".format(ent.text,ent.label_))           
                
                
    crawl_len=crawl_len+1

Crawling webpage ...https://www.cnn.com/world/live-news/coronavirus-pandemic-07-07-20-intl/index.html
Headline : What you need to know
Headline : Study finds coronavirus associated with neurological complications
Headline : Colombia extends coronavirus lockdown measures
	Text : Colombia, Entity : GPE
Headline : Washington state governor blames Southern states reopening early for late Covid-19 test results
	Text : Washington, Entity : GPE
	Text : Southern, Entity : NORP
Headline : South Dakota governor says she tested negative for Covid-19 after Fourth of July event
	Text : South Dakota, Entity : GPE
	Text : Fourth of July, Entity : DATE
Headline : Texas Republicans have no plans to cancel in-person convention in Houston 
	Text : Texas, Entity : GPE
	Text : Republicans, Entity : NORP
	Text : Houston, Entity : GPE
Headline : Columbia University will welcome back 60% of undergraduate students in the fall
	Text : Columbia University, Entity : ORG
	Text : 60%, Entity : PERCENT
Headline : Mo

In [16]:
#crawling through the webpages through the urls and printing the headlines
crawl_len=0
news_dict=[]
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content,formats[crawl_len])
    
    for link in soup.find_all(tags[crawl_len]):
        
        if(len(link.text.split(" ")) > 4):
            print("Headline : {}".format(link.text))
            
            entities=[]
            entities = [(ent.text,ent.label_) for ent in nlp(link.text).ents]
            
            news_dict.append({'website': website[crawl_len],'url': url, 'headline':link.text, 'entities':entities})
            
    
    crawl_len=crawl_len+1          

Headline : What you need to know
Headline : Study finds coronavirus associated with neurological complications
Headline : Colombia extends coronavirus lockdown measures
Headline : Washington state governor blames Southern states reopening early for late Covid-19 test results
Headline : South Dakota governor says she tested negative for Covid-19 after Fourth of July event
Headline : Texas Republicans have no plans to cancel in-person convention in Houston 
Headline : Columbia University will welcome back 60% of undergraduate students in the fall
Headline : More than 45,000 new coronavirus cases reported in Brazil
Headline : Bars ordered to close again in Shelby County, Tennessee
Headline : Texas Education Agency says parents have option to choose remote learning for their children
Headline : Trump says coronavirus crisis will probably 'get worse before it gets better'
Headline : U.S. says China backed hackers who targeted COVID-19 vaccine research
Headline : Coronavirus a 'Category 5 em

The best way to collect and store data for further analysis in python is to store it in a dataframe and thus we are storing the scrapped data in a pandas dataframe:

In [17]:
#collecting the data into a dataframe
news_df=pd.DataFrame(news_dict)

In [18]:
#viewing the dataframe
pd.set_option('max_colwidth',800)

news_df.head(20)

Unnamed: 0,website,url,headline,entities
0,CNN,https://www.cnn.com/world/live-news/coronavirus-pandemic-07-07-20-intl/index.html,What you need to know,[]
1,CNN,https://www.cnn.com/world/live-news/coronavirus-pandemic-07-07-20-intl/index.html,Study finds coronavirus associated with neurological complications,[]
2,CNN,https://www.cnn.com/world/live-news/coronavirus-pandemic-07-07-20-intl/index.html,Colombia extends coronavirus lockdown measures,"[(Colombia, GPE)]"
3,CNN,https://www.cnn.com/world/live-news/coronavirus-pandemic-07-07-20-intl/index.html,Washington state governor blames Southern states reopening early for late Covid-19 test results,"[(Washington, GPE), (Southern, NORP)]"
4,CNN,https://www.cnn.com/world/live-news/coronavirus-pandemic-07-07-20-intl/index.html,South Dakota governor says she tested negative for Covid-19 after Fourth of July event,"[(South Dakota, GPE), (Fourth of July, DATE)]"
5,CNN,https://www.cnn.com/world/live-news/coronavirus-pandemic-07-07-20-intl/index.html,Texas Republicans have no plans to cancel in-person convention in Houston,"[(Texas, GPE), (Republicans, NORP), (Houston, GPE)]"
6,CNN,https://www.cnn.com/world/live-news/coronavirus-pandemic-07-07-20-intl/index.html,Columbia University will welcome back 60% of undergraduate students in the fall,"[(Columbia University, ORG), (60%, PERCENT)]"
7,CNN,https://www.cnn.com/world/live-news/coronavirus-pandemic-07-07-20-intl/index.html,"More than 45,000 new coronavirus cases reported in Brazil","[(More than 45,000, CARDINAL), (Brazil, GPE)]"
8,CNN,https://www.cnn.com/world/live-news/coronavirus-pandemic-07-07-20-intl/index.html,"Bars ordered to close again in Shelby County, Tennessee","[(Shelby County, GPE), (Tennessee, GPE)]"
9,CNN,https://www.cnn.com/world/live-news/coronavirus-pandemic-07-07-20-intl/index.html,Texas Education Agency says parents have option to choose remote learning for their children,"[(Texas Education Agency, ORG)]"


Thus this above dataset can be used to compliment any existing dataset on Covid-19 for better analysis with latest updates from the news websites.