# Scraping News articles using NewsAPI

In this notebook, we will use the NewsAPI from www.newsapi.org to scrape news articles from any of the popular news websites. Click here: https://newsapi.org/sources to see all the possible sources for news data.

This can be used for many purposes including natural language processing. One interesting application could be to use the text from the news headlines to predict movement of specific stocks in the market.

This notebook will show you how to set up a NewsAPI account and pull data. We will also quickly discuss the type of data that is returned because it does not come in a pretty dataframe like you would expect. It is important to understand how lists and dictionaries work to understand this process

Here are the steps to this process:

1. Setup a NewsAPI account and get an API Token
2. Request the raw data from the NewsAPI
3. Convert that raw data into something usable for analysis

## 1. Set up a NewsAPI account and get an API Token

Go to www.newsapi.org. On the top right of the page, click "Get API Key". You'll need to enter your email and a password. They will send you an authorization key via email. Copy that long string to the variable 'api_key' below.

## 2. Request the raw data from the NewsAPI

First, I want to give you a brief idea of how an API works. API stands for Application Programming Interface. Basically, the purpose of an API is to make it easier for any random person to scrape data from a specific site or database without having to go through a bunch of really technical, nitty gritty work. For example, Twitter and Facebook both have API's that let you scrape data about people's behavior on social media. Their API makes it much easier for people to gather data.

For a more detailed description, click here: http://www.programmableweb.com/api-university/what-are-apis-and-how-do-they-work

In [1]:
import urllib3, requests, json

In [2]:
# Make sure to put in the API key you received when creating an account
api_key = "bc09befedac64b07a7291e127340290d"

In this next cell, you will need to make sure to change your URL depending on what source you want your data to come from. In this case, I requested news articles from Tech Crunch, so you will see that I put the string 'techcrunch' as my source in the URL. Check out the documentation to see more about how the request should be formatted as well as other potential sources of news.

In [3]:
# Creating the URL to request the news from
url = "https://newsapi.org/v1/articles?source=techcrunch&apiKey=" + api_key
request = requests.get(url) # make a request to the site

We are going to use a library calld json to help clean our data. JSON stands for javascript object notation and it's the data type that many websites use. It's basically a fancy version of a dictionary, so you can think of it like a python dict.

For more details about json, click here: https://docs.python.org/2/library/json.html

In [4]:
json_data = request.json() # convert website data into json format

In [5]:
# Check out some of the headlines returned from the request
for article in json_data["articles"]:
    print (article["title"])

macOS High Sierra’s best features are the ones you don’t see
Blue Apron IPO off to a rough start
Microsoft confirms Cloudyn acquisition, sources say price is between $50M and $70M
Uber ATG upgrades its autonomous truck test fleet with new tech


In [6]:
# Look at the article text for the first article
print(json_data["articles"][0]["description"])

The new operating system isn't rife with shiny new features, but it brings enhancements under the hood designed to speed up devices. More importantly, they..


## 3. Convert that raw data into something useful

In [7]:
# We can actually convert this data to a dataframe for easier analysis
import pandas as pd

In [8]:
# Convert the dictionary to a dataframe
news_df = pd.DataFrame(json_data["articles"])

news_df.head()

Unnamed: 0,author,description,publishedAt,title,url,urlToImage
0,Brian Heater,The new operating system isn't rife with shiny...,2017-06-29T21:04:01Z,macOS High Sierra’s best features are the ones...,https://techcrunch.com/2017/06/29/macos-high-s...,https://tctechcrunch2011.files.wordpress.com/2...
1,Katie Roof,Meal delivery business Blue Apron opened for t...,2017-06-29T15:01:00Z,Blue Apron IPO off to a rough start,https://techcrunch.com/2017/06/29/blue-apron-i...,https://tctechcrunch2011.files.wordpress.com/2...
2,"Ingrid Lunden, Ron Miller","Back in April, we began hearing that Microsoft...",2017-06-29T14:04:23Z,"Microsoft confirms Cloudyn acquisition, source...",https://techcrunch.com/2017/06/29/microsoft-fi...,https://tctechcrunch2011.files.wordpress.com/2...
3,Darrell Etherington,Uber's Advanced Technologies Group has a new v...,2017-06-29T13:00:03Z,Uber ATG upgrades its autonomous truck test fl...,https://techcrunch.com/2017/06/29/uber-atg-upg...,https://tctechcrunch2011.files.wordpress.com/2...


Feel free to continue this analysis in whatever way you see fit. 