## Raw News Data

The purpose of this notebook is to provide some basic pointers on working with the news data. In its raw form, the newsdata is present in the form of `.json` files, grouped by publication.

Each line of these `.json` files is an article, stored as a `json` object that will be loaded as a dictionary type with two fields: `url` and `html`.

If you run into memory issues, a good way to load articles is to use a generator objects that reads the `.json` files line-by-line:

In [1]:
from json import loads

def get_article_gen(path):
    """ Generator that yields one article at a time. """
    with open(path,'r') as file:
        for line in file:
            yield loads(line.strip('\n'))
            

FILEPATH = './nytimes52.json'

# create a generator that yields articles from the file
articles = get_article_gen(FILEPATH)

# get one article and take a look at it
article = next(articles)
print('article datatype: %s' % type(article))
print('fields: %s' % str(list(article)))

print("\n\nContent:\n")
for field in article:
    print('%s:\n%s\n' % (field,article[field][:300]))

article datatype: <class 'dict'>
fields: ['url', 'html']


Content:

url:
https://www.nytimes.com/2003/02/02/books/in-the-beginning-there-were-paper-towels.html

html:
<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--> <html lang="en" class="no-js section-books format-long tone-review app-article page-theme-standard has-top-ad type-size-small" itemprop="review"itemid="https://www.nytimes.com/2003/02/02/books/in-the-beginning-there-were-paper-towels.html" itemtype="ht



### Dealing with HTML files

As you can see above, the string stored in the `html` field corresponds to the raw html that was downloaded. This has advantage that you use tools such as the awesome `BeautifulSoup` library to automatically extract specific objects from the file -- such as the title of the article, in the example below:

In [2]:
# install via running `>>pip install beautifulsoup4`

from bs4 import BeautifulSoup

soup = BeautifulSoup(article['html'])

print(soup.title)

<title>In the Beginning, There Were Paper Towels - The New York Times</title>


Unfortunately, you will find that this type of thing will not always work so reliably for all publications, and that different publications will require some experimentation and manual tuning. For example, using plain-vanilla `BeautifulSoup` to extract the the text of this article doesn't do a great job at removing the HTML markup from the article. You might just be fine, but depending on what you are building, leaving this kind of junk in the files can confuse your NLP models.

In [3]:
print(soup.text[500:1000])

"&variant="+encodeURIComponent(o||0)+"&url="+encodeURIComponent(t.location.href)+"&instant=1&skipAugment=true\n",a&&f.push(a),c||(c=t.setTimeout(r,0))}function r(){var n=new t.XMLHttpRequest,e=f;n.withCredentials=!0,n.open("POST",u),n.onreadystatechange=function(){var t,o;if(4==n.readyState)for(t=200==n.status?null:new Error(n.statusText);o=e.shift();)o(t)},n.send(s),s="",f=[],c=null}function a(t){for(var n,e,o,r,a,i,u,c=0,s=0,f=[],l=[n=1732584193,e=4023233417,~n,~e,3285377520],h=[],p=t.length;s


Looking for specific flags often works better:

In [4]:
print(
    [item.text for item in soup.findAll('p','story-body-text story-content')][:5]
)

['A BOX OF MATCHES', 'By Nicholson Baker.', '178 pp. New York:', 'Random House. $19.95.', "ON second thought, there may be no such thing as a truly dispassionate observer. To behold the world and the human mind up close is also, somehow, to mourn for them a little. Seen keenly enough, every object, no matter how trivial, is a piercing memento mori. Take paper towels. As Nicholson Baker points out in his new novel -- a marvel of ship-in-a-bottle miniaturism that no one else could have written, or would have thought to write -- the decorative patterns on paper towels change through the years in response to tastes and fashions, articulating the larger cultural flux as effectively as art museum biennials. For most people such changes aren't worth noticing, let alone contemplating in detail, and yet, when combined with the countless other small narratives that play themselves out in the background of awareness -- the silent disintegration of an old sock, the unconscious refinement of a new 

You will also find that using `BeautifulSoup` to parse large volumes of text is relatively slow. Depending on what you're doing, I wouldn't be shy trying to work with regular expressions.

### Snippets for Extracting Text

Here are some snippets that extract text for some publications in particular -- you can see if they work for you!

In [5]:
html = article['html']

# nytimes
soup = BeautifulSoup(html,'html.parser')
paragraphs = [tag.get_text() for tag in soup.findAll('title')+soup.findAll('p','story-body-text story-content')]

# breitbart
soup = BeautifulSoup(html,'html.parser')
paragraphs = [
        tag.get_text() for tag in soup.findAll('title')]+[
                item['content'] for item in soup.findAll('meta',property=True) if item['property'] == 'og:description']+[
                        tag.get_text() for tag in soup.findAll('p')]

# buzzfeed
soup = BeautifulSoup(html,'html.parser')
paragraphs = [tag.get_text() for tag in soup.findAll('title')+soup.findAll('p')]

# fox news
soup = BeautifulSoup(html,'html.parser')
paragraphs = [
        tag.get_text() for tag in soup.findAll('title')]+[
                item['content'] for item in soup.findAll('meta',property=True) if item['property'] == 'og:description']+[
                        tag.get_text() for tag in soup.findAll('p')]

# huffpo
soup = BeautifulSoup(html,'html.parser')
paragraphs = [tag.get_text() for tag in soup.findAll('title')+soup.findAll('p')]

# nydailynews
soup = BeautifulSoup(html,'html.parser')
if soup.findAll("article"):
    paragraphs = [tag.get_text().strip('\r\n\t') for tag in 
                  soup.findAll("title")+
                  soup.findAll("article")[0].findAll('p')+
                  soup.findAll("span",itemprop="caption")+
                  soup.findAll("p","g-article-html")]
    
# nypost
soup = BeautifulSoup(html,'html.parser')
paragraphs = [tag.get_text() for tag in [soup.findAll('title')[0]]+soup.findAll('description')[1:]+soup.findAll(type='html')+soup.findAll('p')]

# wapo
soup = BeautifulSoup(html,'html.parser')
paragraphs = [tag.get_text() for tag in soup.findAll('title')+
              soup.findAll('meta',property="og:description")+
              soup.findAll('p')]

# wsj
soup = BeautifulSoup(html,'html.parser')
paragraphs = [tag.get_text() for tag in 
    soup.findAll('title')+soup.findAll("h1",'wsj-article-headline')+soup.findAll("h2",'sub-head')+soup.findAll("p")]