In this notebook, we will prepare the lyrics data downloaded from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/News+Aggregator).

_Credit: for the preparation of the news data, we took some of the code privided by this Towards Data Science [post](https://towardsdatascience.com/conditional-text-generation-by-fine-tuning-gpt-2-11c1a9fc639d)._

# Main Steps
1. Load in the datasets
3. Keep only the data that meets the following criteria:
    - Business news (``CATEGORY``=='b')
    - Random selection of 10k 
4. Fetch the news content from the link in the dataset

In [1]:
import pandas as pd
from newspaper import Article

random_state = 1234

columns = ['ID',
           'TITLE',
           'URL',
           'PUBLISHER',
           'CATEGORY', #News category (b = business, t = science and technology, e = entertainment, m = health)
           'Alphanumeric ID',
           'HOSTNAME Url',
           'TIMESTAMP']

newsData = pd.read_csv("./raw_data/newsCorpora.csv",
                       delimiter = '	', names = columns)
print("Original news df shape: ", newsData.shape)

# keep only business news
newsData = newsData[newsData.CATEGORY == 'b']
print("Business news df shape: ", newsData.shape)
# print("Science and Technology news df shape", newsData.shape)

# due to calculation power limit, we only keep the randomly-selected 10k news
newsData = newsData.sample(10**4, random_state = random_state)
print("Randomly select 10k Business news df shape: ", newsData.shape)


newsData.head(3)

Original news df shape:  (422419, 8)
Business news df shape:  (115967, 8)
Randomly select 10k Business news df shape:  (10000, 8)


Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,Alphanumeric ID,HOSTNAME Url,TIMESTAMP
44056,44057,Missing Malayasia Airlines MH370: China sends ...,http://www.india.com/loudspeaker/missing-malay...,India.com,b,dpJ38zm8NY8hUFM22x2aL0hc9hjfM,www.india.com,1395824966103
316897,317357,5 Things to Know About the Changing State of O...,http://blogs.wsj.com/briefly/2014/06/25/5-thin...,Wall Street Journal \(blog\),b,doclsNbkIioH-9Maj9kRRSWSdRTuM,blogs.wsj.com,1403775235740
367142,367602,ECB's Draghi's Plan May Miss the Target,http://www.marketpulse.com/20140706/ecbs-dragh...,MarketPulse \(blog\),b,dybB1-1U7xt8vLMLKhbW1GpBlkggM,www.marketpulse.com,1404822872146


In [2]:
from tqdm.contrib import tzip
newsData['got_text'] = 0


newsData_1 = newsData.iloc[:3333]
newsData_2 = newsData.iloc[3333:6666]
newsData_3 = newsData.iloc[6666:]

In [3]:
for row, url in tzip(newsData_1.index, newsData_1.URL):
    url_new = 'https'+url[4:]
#     print(row, url)
    try:
        article = Article(url_new)
        article.download()
        article.parse()
        newsData_1.loc[row, 'text'] = article.text
        if article.text:
            newsData_1.loc[row, 'got_text'] = 1 
    except: pass
    
newsData_1

  0%|          | 0/3333 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,Alphanumeric ID,HOSTNAME Url,TIMESTAMP,got_text,text
44056,44057,Missing Malayasia Airlines MH370: China sends ...,http://www.india.com/loudspeaker/missing-malay...,India.com,b,dpJ38zm8NY8hUFM22x2aL0hc9hjfM,www.india.com,1395824966103,1,Also Read - Australia: Memorial Planned Honour...
316897,317357,5 Things to Know About the Changing State of O...,http://blogs.wsj.com/briefly/2014/06/25/5-thin...,Wall Street Journal \(blog\),b,doclsNbkIioH-9Maj9kRRSWSdRTuM,blogs.wsj.com,1403775235740,0,
367142,367602,ECB's Draghi's Plan May Miss the Target,http://www.marketpulse.com/20140706/ecbs-dragh...,MarketPulse \(blog\),b,dybB1-1U7xt8vLMLKhbW1GpBlkggM,www.marketpulse.com,1404822872146,1,Mario Draghi’s plan to end the euro area’s len...
24979,24980,Schneiderman: Fast stock trade advantages unfair,http://poststar.com/news/state-and-regional/sc...,Glens Falls Post-Star,b,d5fZSL3c7uQ95bMK38nkSLgMo0cAM,poststar.com,1395320415890,0,
47630,47631,Candy Crush Saga shares dive 10% in Wall Stree...,http://www.thisismoney.co.uk/money/markets/art...,This is Money,b,dQtK6MP92Wshd3MbmOLp71BNJ1smM,www.thisismoney.co.uk,1395880982244,1,Candy Crush Saga shares dive 10% in Wall Stree...
...,...,...,...,...,...,...,...,...,...,...
348807,349267,HK currency intervention is a positive sign,http://www.ig.com/uk/market-update/2014/07/03/...,IG,b,dkpRSIMBCMKcw0MGAdthwBjOXNMEM,www.ig.com,1404381649556,1,Spread bets and CFDs are complex instruments a...
356825,357285,Uber's London operations remain in legal limbo...,http://venturebeat.com/2014/07/03/ubers-london...,VentureBeat,b,deamRsxqjRde37Mkd7klW11hc6dcM,venturebeat.com,1404522180234,0,
211834,212280,Call for end to anti-China protests,http://www.belfasttelegraph.co.uk/news/world-n...,Belfast Telegraph,b,dqs7XmgHyoVndfMqMxxv0Zk_zZgoM,www.belfasttelegraph.co.uk,1400373219633,1,Mobs burned and looted scores of foreign-owned...
376762,377222,"July 15, 2014, 4:51 am",https://au.news.yahoo.com/thewest/business/wor...,The West Australian,b,dDbZs5yNvMXdzyM2eLkppVUBnmu6M,au.news.yahoo.com,1405422893222,0,


In [5]:
for row, url in tzip(newsData_2.index, newsData_2.URL):
    url_new = 'https'+url[4:]
#     print(row, url)
    try:
        article = Article(url_new)
        article.download()
        article.parse()
        newsData_2.loc[row, 'text'] = article.text
        if article.text:
            newsData_2.loc[row, 'got_text'] = 1 
    except: pass
    
newsData_2

  0%|          | 0/3333 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,Alphanumeric ID,HOSTNAME Url,TIMESTAMP,got_text,text
29430,29431,Gul at odds with Erdogan over Twitter ban,http://gulfnews.com/news/world/other-world/tur...,gulfnews.com,b,drWjpn6_UzSkmiM_9qWYJe3-LIGPM,gulfnews.com,1395504371448,1,Istanbul: Turkish President Abdullah Gul set h...
314477,314937,Hazard Insurers May Face Suit Over Fees to Fan...,http://www.bloomberg.com/news/2014-06-25/hazar...,Bloomberg,b,dYl0oLFTVhcqb7M89YQbLlPHc5HuM,www.bloomberg.com,1403701980981,1,Why did this happen?\n\nPlease make sure your ...
204381,204817,Nissan profit rises as global sales pick up,http://www.newsday.com/business/nissan-profit-...,Newsday,b,dz8V3qjYFB3qJkM-K5H53BVUMOfXM,www.newsday.com,1399911319557,0,
171067,171403,UPDATE 2-EBay beefs up US war chest in pursuit...,http://www.reuters.com/article/2014/04/29/ebay...,Reuters,b,d4LdeQeG96e-2JMUeby0G-P-7w3hM,www.reuters.com,1398849377518,0,
24379,24380,"Boeing 787's design, manufacture safe, FAA says",http://www.tulsaworld.com/business/aerospace/b...,Tulsa World,b,dhpfyfVbJn2C9wMqhHh7BvZNhtmvM,www.tulsaworld.com,1395317048158,1,WASHINGTON — Boeing's design and manufacture ...
...,...,...,...,...,...,...,...,...,...,...
196412,196848,"COLUMN-Was Barclays the problem, or was it the...",http://www.reuters.com/article/2014/05/08/colu...,Reuters,b,dmfYAYOSCXeIgdMeNy_RGzRlSo-4M,www.reuters.com,1399625744680,0,
158868,159204,Amazon's Classic Take On The Classics,http://blogs.wsj.com/moneybeat/2014/04/24/amaz...,Wall Street Journal \(blog\),b,dmQqhPNKBmqhwnMNnj-ERzfMdzl6M,blogs.wsj.com,1398393873190,0,
127632,127968,General Mills' New Privacy Policy Restricts Co...,http://publicradioeast.org/post/general-mills-...,Public Radio East,b,d-0bCW2yA_8Sq2MzCOcNrOt3VU-bM,publicradioeast.org,1397760846283,0,
296772,297232,SolarCity gets into photovoltaic manufacturing...,http://www.bizjournals.com/sanjose/news/2014/0...,Silicon Valley Business Journal,b,dZmv_NETbjeVQ4MocN2FOaVtOyFLM,www.bizjournals.com,1403066474971,0,


In [6]:
for row, url in tzip(newsData_3.index, newsData_3.URL):
    url_new = 'https'+url[4:]
#     print(row, url)
    try:
        article = Article(url_new)
        article.download()
        article.parse()
        newsData_3.loc[row, 'text'] = article.text
        if article.text:
            newsData_3.loc[row, 'got_text'] = 1 
    except: pass
    
newsData_3

  0%|          | 0/3334 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,Alphanumeric ID,HOSTNAME Url,TIMESTAMP,got_text,text
316781,317241,GDP: Final Estimate For First Quarter Is Much ...,http://www.forbes.com/sites/mikepatton/2014/06...,Forbes,b,dDkTxCFNrRmzqyMPP9l7fynC7yuiM,www.forbes.com,1403773543378,0,
184823,185159,OECD: Clouds hang over emerging markets,http://www.iol.co.za/business/international/oe...,Independent Online,b,djYdAphw5jDBdqMzI96Ssguo2rdAM,www.iol.co.za,1399437419419,0,
307351,307811,HERE COMES FLASH US PMI...,http://www.businessinsider.in/HERE-COMES-FLASH...,Businessinsider India,b,d7ciPlHSS0FPGHM1D-xxX_18nfo-M,www.businessinsider.in,1403539558316,1,"At 9:45 a.m. ET, Markit will publish the preli..."
376990,377450,Whiting's $6B Deal for Kodiak Would Form Large...,http://www.naturalgasintel.com/articles/99004-...,Natural Gas Intelligence,b,dap7TgQqIJKCXtMuS6rCqs8qAoRkM,www.naturalgasintel.com,1405425429789,1,In a move that did not appear to surprise anal...
101518,101715,Agostini: Colin Kaepernick's mistake is leavin...,http://www.modbee.com/2014/04/11/3288692/agost...,Modesto Bee,b,dq4CkE5dd_NRkmMCBHHuQlgb-upoM,www.modbee.com,1397290663256,0,
...,...,...,...,...,...,...,...,...,...,...
124304,124640,AG Issues Warning About Wedding Planning Services,http://www.northcountrygazette.org/2014/04/15/...,North Country Gazette,b,d4R99H_bzn5317MgwOcNIUW8-ppSM,www.northcountrygazette.org,1397702580253,0,
28960,28961,IRS Watchdog: Impostor Phone Scam Largest Ever,http://blog.aarp.org/2014/03/21/irs-watchdog-i...,AARP News \(blog\),b,dR39a35kcuso-9M3bXDszPvdbxY1M,blog.aarp.org,1395503779820,0,
223591,224037,Vodafone earnings tumble after 'mixed' perform...,http://www.scotsman.com/business/media-tech-le...,Scotsman,b,d5U_wamCIMERFaMINjAWXKgW9PJcM,www.scotsman.com,1400581852375,0,
400680,401199,NLRB's New Ruling Could Mean Great Things for ...,http://inthesetimes.com/working/entry/17015/fi...,In These Times,b,dVHK1NwbYZuSB-MZayQWVGdNouSYM,inthesetimes.com,1406778000166,0,


In [22]:
newsData_new = newsData_1.append([newsData_2,newsData_3])

newsData_new = newsData_new[newsData_new.got_text==1]
newsData_new

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,Alphanumeric ID,HOSTNAME Url,TIMESTAMP,got_text,text
44056,44057,Missing Malayasia Airlines MH370: China sends ...,http://www.india.com/loudspeaker/missing-malay...,India.com,b,dpJ38zm8NY8hUFM22x2aL0hc9hjfM,www.india.com,1395824966103,1,Also Read - Australia: Memorial Planned Honour...
367142,367602,ECB's Draghi's Plan May Miss the Target,http://www.marketpulse.com/20140706/ecbs-dragh...,MarketPulse \(blog\),b,dybB1-1U7xt8vLMLKhbW1GpBlkggM,www.marketpulse.com,1404822872146,1,Mario Draghi’s plan to end the euro area’s len...
47630,47631,Candy Crush Saga shares dive 10% in Wall Stree...,http://www.thisismoney.co.uk/money/markets/art...,This is Money,b,dQtK6MP92Wshd3MbmOLp71BNJ1smM,www.thisismoney.co.uk,1395880982244,1,Candy Crush Saga shares dive 10% in Wall Stree...
41832,41833,Why excessive executive pay is a mistake,http://www.cbsnews.com/news/why-excessive-exec...,CBS News,b,dCT-PMJKDpxizFM1KG4HNznXSzt9M,www.cbsnews.com,1395772526278,1,"Commentary:\n\nWhen David Winters, a longtime ..."
66042,66043,CBS Outdoor to Start Trading,http://www.nasdaq.com/article/cbs-outdoor-to-s...,NASDAQ,b,dUzQIN3A2UMV7cM7WXmdwd290P7pM,www.nasdaq.com,1396177634518,1,Your symbols have been updated\n\nYou'll now b...
...,...,...,...,...,...,...,...,...,...,...
259019,259465,Presidents of Russia and France to discuss Ukr...,http://www.theguardian.com/world/2014/may/28/v...,The Guardian,b,dCXZzr8rVI5aTGMgAB8jJXCTtcnGM,www.theguardian.com,1401288128176,1,"The Russian president, Vladimir Putin, will di..."
301526,301986,Adobe's 2014 Creative Cloud update: Desktop up...,http://www.zdnet.com/adobes-2014-creative-clou...,ZDNet,b,dx5rs7ln2b_Ad8MMGjFrLWi0Q-R6M,www.zdnet.com,1403115969825,1,"Adobe's Creative Cloud (CC), the subscription-..."
51636,51637,Here's Where To Download Microsoft Office For ...,http://www.businessinsider.in/Heres-Where-To-D...,Businessinsider India,b,dUlKgaqxUngrA5MKtr_8lQs0ElRzM,www.businessinsider.in,1395951444425,1,"Microsoft launched Office for iPad today, maki..."
333659,334119,Housing Update: Pending Home Sales Surge In May,http://www.benzinga.com/news/14/06/4670897/hou...,Benzinga,b,d_gSOVyb6AktUYMCiBrZgMdhpbkTM,www.benzinga.com,1404144616697,1,The National Association of Realtors reports t...


In [23]:
newsData_new.to_csv('news_data.csv')

In [12]:
newsData_new[newsData_new.ID==322309]

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,Alphanumeric ID,HOSTNAME Url,TIMESTAMP,got_text,text
321849,322309,May consumer spending up modestly at 0.2%,http://www.businesstimes.com.sg/premium/world/...,THE BUSINESS TIMES \(subscription\),b,dcnLpLtxYqnLt9MRM43DAw7DaNy2M,www.businesstimes.com.sg,1403852150072,0,
