# Data Collection

Let's scrape data from the Good News Network website for, well, good news, and from huffpost for both good and bad news. This should give us a reasonably sized dataset of labeled examples!

In [1]:
!pip install -q bs4 requests grequests lxml

In [2]:
import pandas as pd

# goodnews network and huffpost are two helper modules I've created to make scraping the sites much easier
import goodnewsnetwork
import huffpost
import newser

  curious_george.patch_all(thread=False, select=False)


Setting max results to 10,000 for both good and bad news from HuffPost should hopefully catch all the articles. It seems unlikely that huffpost has written and categorized that many articles for both good and bad news, especially considering that they were launched in 2005.

In [3]:
huff_good = huffpost.scrape(category = 'good-news', max_results = 10000)
huff_bad = huffpost.scrape(category = 'bad-news', max_results = 10000)

# scrape the first 250 pages of goodnewsnetwork.org (the first 5000 articles)
gnn_articles = goodnewsnetwork.scrape(stop = 250)

newser_articles = newser.scrape(num_requests = 10, articles_per_request = 500)

articles = huff_good + huff_bad + gnn_articles + newser_articles

articles_dataframe = pd.DataFrame(articles)

articles_dataframe.to_csv('../data/articles.csv')

Scraped 250 pages from Good News Network, now adding descriptions.

Expected articles_per_request to be a multiple of 3, since articles are returned by newser.com in groups of three. Fetching 501 articles instead.



Next, let's scrape some articles from Huffpost's [Impact](https://www.huffpost.com/impact/) category, which is "dedicated to causes, actionable news, inspiring stories and solutions of all scales," according to their LinkedIn Page. It's safe to say that we can label these as good news. 

In [31]:
articles_dataframe = pd.read_csv('../data/articles.csv')

huff_impact = huffpost.scrape(category = 'impact', max_results = 10000)

for article in huff_impact:
    '''the way the algorithm is written, the huffpost scraper sets the sentiment category to 
    the value of the *news* category passed to it. This replaces the category 'impact' with 'good'
    for consistency.'''
    article['category'] = 'good'

In [32]:
impact_df = pd.DataFrame(huff_impact)

articles_dataframe = articles_dataframe.append(impact_df)

To round out the dataset, let's grab some neutral articles. After looking through Huffpost, the Style & Beauty category and the Home & Living category seem to be fairly neutral, with mostly informational content. Without a human labeling the data, it's tough to identify neutral articles, so this may be the best we can do.

In [33]:
style_and_beauty = huffpost.scrape(category = 'style-beauty', max_results = 10000)
home_and_living = huffpost.scrape(category = 'home-living', max_results = 10000)

neutral_articles = style_and_beauty + home_and_living
for article in neutral_articles:
    article['category'] = 'neutral'
    
neutral_dataframe = pd.DataFrame(neutral_articles)

articles_dataframe = articles_dataframe.append(neutral_dataframe)

articles_dataframe.to_csv('../data/articles.csv')

In [44]:
len(neutral_dataframe)

285

Looks like we could definitely use some more neutral articles. Also, clearly we mislabled some of the bad articles as good, but we can handle that later in the data cleaning notebook.

In [39]:
articles_dataframe = pd.read_csv('../data/articles.csv')

weird_news = huffpost.scrape(category = 'weird-news', max_results = 10000)

food_drink = huffpost.scrape(category = 'food-drink', max_results = 10000)

neutrals = weird_news + food_drink

for article in neutrals:
    article['category'] = 'neutral'
    
neutral_df = pd.DataFrame(neutrals)

articles_dataframe = articles_dataframe.append(neutral_df)

articles_dataframe.to_csv('../data/articles.csv')

In [40]:
len(neutral_df)

239

Okay, so that got us a *few* more examples.

In [41]:
articles = pd.read_csv('../data/articles.csv')

In [43]:
articles['category'].value_counts()

good       10901
neutral      523
bad          453
Name: category, dtype: int64

We may have to live with it. We can try some augmentation later if really necessary but otherwise, let's just move on.