Bloomberg video news scraping code.
Aaron Tian

Most recent update: May 13, 3:15 PM
Wrote a function to conduct web scraping, user only need to input the searching keyword and the number of articles required. 

Drawbacks: As discussed on May 12, we use the video tab to search old news from Bloomberg, so the content is a video clip and there is no real content in the final output file(no "author" as well). The other issue is that there is no "url" field available. However, the most important text information on the webpage is already included in the "title" and "description" fields. If we only need these two fields to conduct sentiment analysis, then we already have enough information to use.  

To Do:
Amy discovered an API called Ajax that could be used to scrape news in articles from Bloomberg. I will look into that. 

In [24]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import time
import json

### Bloomberg Video News Scraper

In [37]:
def bnn_videonews_scraper(query, total_num):
    '''
    Srape the video news from BNN Bloomberg website with a search query, and required number of articles.
    Return a json file containing the returned articles
    
    input:
    
    query: (str) search keyword
    total_num: (int) number of articles requested by the user
    '''
    output_list = []
    query = query.split()
    search_query = '%20'.join(query)
        
    if total_num %10 != 0:
        print("Please enter a number that is the multiple of 10.")
        return
    search_page = total_num // 10
    
    
    news_list = []
    for i in range(1,search_page+1):
        url = f'https://capi.9c9media.com/destinations/bnn_web/platforms/desktop/contents?$inlinecount=true&$include=[Images,Desc,ShortDesc,BroadcastDate,Type,BroadcastTime,ContentPackages,Media,Keywords,Genres,Tags]&$page={i}&$top=10&$sort=BroadcastDate&$order=Desc&$search={search_query}'
        response = requests.get(url)
        json_data = response.json()
        news_list.extend(json_data['Items'])
        time.sleep(1)
    
    for article in news_list:
        article_dict = {}
        
        article_dict['source'] = 'Bloomberg'
        article_dict['author'] = None
        article_dict['title'] = article.get('Name', None)
        article_dict['description'] = article.get('Desc', None)
        article_dict['url'] = None
        article_dict['urlToImage'] = article['Images'][0].get('Url', None)
        article_dict['PublishedAt'] = article.get('BroadcastDate', None)
        article_dict['content'] = None
        
        output_list.append(article_dict)
        
    if len(output_list) < total_num:
        print(f"The number of articles that related to searching searching query is less than {total_num}, please try another query.")
        
    with open('_'.join(query) + '_' + str(len(output_list)) + '_' +'Bloomberg_video' + '.json', 'w') as json_file:
        json.dump(output_list, json_file)
        
    
    
    return output_list

### Mortgage Rates

In [28]:
# bloomberg_mortgage_rates = []
# for i in range(1,6):
#     url = f'https://capi.9c9media.com/destinations/bnn_web/platforms/desktop/contents?$inlinecount=true&$include=[Images,Desc,ShortDesc,BroadcastDate,Type,BroadcastTime,ContentPackages,Media,Keywords,Genres,Tags]&$page={i}&$top=10&$sort=BroadcastDate&$order=Desc&$search=mortgage%20rates'
#     response = requests.get(url)
#     json_data = response.json()
#     bloomberg_mortgage_rates.extend(json_data['Items'])
#     time.sleep(1)

In [29]:
bloomberg_mortgage_rates = bnn_videonews_scraper('mortgage rates', 50)

In [30]:
assert(len(bloomberg_mortgage_rates) == 50)

### Interest Rates

In [68]:
# bloomberg_interest_rates = []
# for i in range(1,6):
#     url = f'https://capi.9c9media.com/destinations/bnn_web/platforms/desktop/contents?$inlinecount=true&$include=[Images,Desc,ShortDesc,BroadcastDate,Type,BroadcastTime,ContentPackages,Media,Keywords,Genres,Tags]&$page={i}&$top=10&$sort=BroadcastDate&$order=Desc&$search=interest%20rates'
#     response = requests.get(url)
#     json_data = response.json()
#     bloomberg_interest_rates.extend(json_data['Items'])
#     time.sleep(1)

In [31]:
bloomberg_interest_rates = bnn_videonews_scraper('interest rates', 50)

In [32]:
assert(len(bloomberg_interest_rates) == 50)

### Housing Price

In [74]:
# bloomberg_housing_price = []
# for i in range(1,11):
#     url = f'https://capi.9c9media.com/destinations/bnn_web/platforms/desktop/contents?$inlinecount=true&$include=[Images,Desc,ShortDesc,BroadcastDate,Type,BroadcastTime,ContentPackages,Media,Keywords,Genres,Tags]&$page={i}&$top=10&$sort=BroadcastDate&$order=Desc&$search=housing%20price'
#     response = requests.get(url)
#     json_data = response.json()
#     bloomberg_housing_price.extend(json_data['Items'])
#     time.sleep(1)

In [38]:
bloomberg_housing_price = bnn_videonews_scraper('housing price', 50)

The number of articles that related to searching searching query is less than 50, please try another query.


In [41]:
bloomberg_housing = bnn_videonews_scraper('housing', 50)

In [42]:
assert(len(bloomberg_housing) == 50)

### Employment

In [81]:
# bloomberg_employment = []
# for i in range(1,6):
#     url = f'https://capi.9c9media.com/destinations/bnn_web/platforms/desktop/contents?$inlinecount=true&$include=[Images,Desc,ShortDesc,BroadcastDate,Type,BroadcastTime,ContentPackages,Media,Keywords,Genres,Tags]&$page={i}&$top=10&$sort=BroadcastDate&$order=Desc&$search=employment'
#     response = requests.get(url)
#     json_data = response.json()
#     bloomberg_employment.extend(json_data['Items'])
#     time.sleep(1)

In [48]:
bloomberg_employment = bnn_videonews_scraper('employment', 50)

In [49]:
assert(len(bloomberg_employment) == 50)

### GDP

In [85]:
# bloomberg_gdp = []
# for i in range(1,6):
#     url = f'https://capi.9c9media.com/destinations/bnn_web/platforms/desktop/contents?$inlinecount=true&$include=[Images,Desc,ShortDesc,BroadcastDate,Type,BroadcastTime,ContentPackages,Media,Keywords,Genres,Tags]&$page={i}&$top=10&$sort=BroadcastDate&$order=Desc&$search=GDP'
#     response = requests.get(url)
#     json_data = response.json()
#     bloomberg_gdp.extend(json_data['Items'])
    time.sleep(1)

In [46]:
bloomberg_gdp = bnn_videonews_scraper('GDP', 50)

In [47]:
assert(len(bloomberg_gdp) == 50)

### Stock Market

In [87]:
# bloomberg_stock_market = []
# for i in range(1,6):
#     url = f'https://capi.9c9media.com/destinations/bnn_web/platforms/desktop/contents?$inlinecount=true&$include=[Images,Desc,ShortDesc,BroadcastDate,Type,BroadcastTime,ContentPackages,Media,Keywords,Genres,Tags]&$page={i}&$top=10&$sort=BroadcastDate&$order=Desc&$search=stock%20market'
#     response = requests.get(url)
#     json_data = response.json()
#     bloomberg_stock_market.extend(json_data['Items'])
#     time.sleep(1)

In [50]:
bloomberg_stock_market = bnn_videonews_scraper('stock market', 50)

In [51]:
assert(len(bloomberg_stock_market) == 50)