# API vs Web Craping

- Extracting informaiton from websites can be done via scraping or by working with the site API if there is one 
    - working with APIs is preferable 
    - Comparison of Web Scraping vs. API for Hacker News
    
## RAPTOR 
Raptor means: Review - Access - Parse - Transorm -Store.

| |Web Server | Web Server + API|
|:---|:---------|:-----------|
|Review | HTML structure (tags, attributes, etc.) | Parameters and structure from documentation|


### Hacker News Example
[Hacker News](https://news.ycombinator.com/) is a social Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by the investment fund and startup incubator Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity."

- It also offers an API providing structured, JSON-formatted results
    - Base URL: https://hacker-news.firebaseio.com/v0
 
- See explanation and documentation at: http://github.com/HackerNews/API

- The new Python 


1. First, let's try to scrape all the article title, link, and score from https://news.ycombinator.com/.

In [54]:
# source: adpated from Broucke & Baessen (Chp. b9)
import requests 
from bs4 import BeautifulSoup

# articles is an list that will hold info about each article 
articles = []

url = 'http://news.ycombinator.com/news'
r = requests.get(url)
html_soup = BeautifulSoup(r.text, 'html.parser')

# get all rows in news table
for item in html_soup.find_all('tr', attrs = {'class':'athing'}):
    
    # scrape the title of each news 
    item_title = item.find('td', attrs = {'class':'title'}).find_next_sibling('td', attrs = {'class':'title'}).text
    # find the hyperlink tag of each news
    item_a = item.find('a', attrs = {'rel':'noreferrer'})
    # extract href attribute from hyperlink tag
    item_link = item_a.get('href') if item_a else None
    # find the span tag with scores
    item_score = item.find('span', attrs = {'class':'score'})
    # find the row next to the above row and get scores
    next_row = item.find_next_sibling('tr')
    item_score = next_row.find('span', attrs = {'class':'score'})
    item_score = item_score.get_text(strip = True) if item_score else '0 points'
    
    articles.append({"Title":item_title, "Link": item_link, "Score": item_score})
    
# append the article info 
for article in articles:
    print(article)

{'Title': 'Software Engineering at Google (abseil.io)', 'Link': 'https://abseil.io/resources/swe-book/html/toc.html', 'Score': '24 points'}
{'Title': 'Analysis of the data job market using HN job posts (emiruz.com)', 'Link': 'https://emiruz.com/post/2023-08-12-data-jobs/', 'Score': '22 points'}
{'Title': 'The 2002 Überlingen midair collision (admiralcloudberg.medium.com)', 'Link': 'https://admiralcloudberg.medium.com/tears-in-the-rain-the-2002-%C3%BCberlingen-midair-collision-591232d0c51e', 'Score': '45 points'}
{'Title': 'How to run a miserable code review (badsoftwareadvice.substack.com)', 'Link': 'https://badsoftwareadvice.substack.com/p/how-to-run-a-miserable-code-review', 'Score': '10 points'}
{'Title': 'Nobody ever paid me for code (bitecode.dev)', 'Link': 'https://www.bitecode.dev/p/nobody-ever-paid-me-for-code', 'Score': '108 points'}
{'Title': 'Writing about what you learn pushes you to understand topics better (addyosmani.com)', 'Link': 'https://addyosmani.com/blog/write-lear

2. Let's try to use the Hacker News API to scrape all the articles from the news website. We are using HTTP request to obtain the response from API.

In [66]:
import requests 
articles = []
url = 'https://hacker-news.firebaseio.com/v0'

# let's add the top stories element based on the API official document
top_stories = requests.get(url + '/topstories.json').json()

# see how many IDs we get
print(len(top_stories))
type(top_stories)

500


list

In [69]:
# these are ids of the news
top_stories[:5]

[37121180, 37120874, 37120911, 37120372, 37120967]

In [70]:
for story_id in top_stories[:5]:
    story_url = url + '/item/{}.json'.format(story_id)
    print("Fetching:", story_url)
    
    # make http request to each story URL
    r = requests.get(story_url)
    # your response is json-encoded
    story_dict = r.json()
    # store each story info in a list
    articles.append(story_dict)
    

Fetching: https://hacker-news.firebaseio.com/v0/item/37121180.json
Fetching: https://hacker-news.firebaseio.com/v0/item/37120874.json
Fetching: https://hacker-news.firebaseio.com/v0/item/37120911.json
Fetching: https://hacker-news.firebaseio.com/v0/item/37120372.json
Fetching: https://hacker-news.firebaseio.com/v0/item/37120967.json


In [92]:
# display the key information you want
for article in articles[:10]:
    print(article['title'], article['url'], article['score'])

Software Engineering at Google https://abseil.io/resources/swe-book/html/toc.html 39
Analysis of the data job market using HN job posts https://emiruz.com/post/2023-08-12-data-jobs/ 25
How to run a miserable code review https://badsoftwareadvice.substack.com/p/how-to-run-a-miserable-code-review 16
The 2002 Überlingen midair collision https://admiralcloudberg.medium.com/tears-in-the-rain-the-2002-%C3%BCberlingen-midair-collision-591232d0c51e 48
I built a garbage collector for a language that doesn't need one https://claytonwramsey.github.io/2023/08/14/dumpster.html 13
Writing about what you learn pushes you to understand topics better https://addyosmani.com/blog/write-learn/ 354
Nobody ever paid me for code https://www.bitecode.dev/p/nobody-ever-paid-me-for-code 112
Svix (YC W21) Is Hiring a Founding Account Executive (US Remote) https://www.svix.com/careers/ 1
Show HN: Little Rat – Chrome extension monitors network calls of all extensions https://github.com/dnakov/little-rat 17
Inside 

### How to retrieve the top 10 stories 
First method: query encoding in API requests

In [95]:
import requests

url10 = 'https://hacker-news.firebaseio.com/v0/topstories.json?limitToFirst=10&orderBy="$key"'
ten_top_stories = requests.get(url10).json()
print(ten_top_stories)

[37121180, 37120982, 37120874, 37120911, 37120967, 37120372, 37119942, 37118883, 37120715, 37108833]


In [96]:
url = 'https://hacker-news.firebaseio.com/v0'
articles = []

for story_id in ten_top_stories:
    
    # create article link for each 
    story_url = url + '/item/{}.json'.format(story_id)
    print('Fetching: ' + story_url)
    
    r = requests.get(story_url)
    story_dict = r.json()
    
    articles.append(story_dict)

Fetching:https://hacker-news.firebaseio.com/v0/item/37121180.json
Fetching:https://hacker-news.firebaseio.com/v0/item/37120982.json
Fetching:https://hacker-news.firebaseio.com/v0/item/37120874.json
Fetching:https://hacker-news.firebaseio.com/v0/item/37120911.json
Fetching:https://hacker-news.firebaseio.com/v0/item/37120967.json
Fetching:https://hacker-news.firebaseio.com/v0/item/37120372.json
Fetching:https://hacker-news.firebaseio.com/v0/item/37119942.json
Fetching:https://hacker-news.firebaseio.com/v0/item/37118883.json
Fetching:https://hacker-news.firebaseio.com/v0/item/37120715.json
Fetching:https://hacker-news.firebaseio.com/v0/item/37108833.json


In [99]:
# display the key information you want
for article in articles:
    print(article['title'], article['url'], article['score'])

Software Engineering at Google https://abseil.io/resources/swe-book/html/toc.html 82
Inside The Decline of Stack Exchange https://www.thediff.co/archive/inside-the-decline-of-stack-exchange/ 32
Analysis of the data job market using HN job posts https://emiruz.com/post/2023-08-12-data-jobs/ 39
How to run a miserable code review https://badsoftwareadvice.substack.com/p/how-to-run-a-miserable-code-review 25
I built a garbage collector for a language that doesn’t need one https://claytonwramsey.github.io/2023/08/14/dumpster.html 30
The 2002 Überlingen midair collision https://admiralcloudberg.medium.com/tears-in-the-rain-the-2002-%C3%BCberlingen-midair-collision-591232d0c51e 59
Show HN: Little Rat – Chrome extension monitors network calls of all extensions https://github.com/dnakov/little-rat 31
Writing about what you learn pushes you to understand topics better https://addyosmani.com/blog/write-learn/ 366
Svix (YC W21) Is Hiring a Founding Account Executive (US Remote) https://www.svix.co

The other approach consists of defining a dict and pass it as a parameter
   - this along with the headers allows to make a more specific request to an API
   - it's recommended when developer key is needed 
   
### Specific API requests
 
- To make the API requests more specific, use headers and parameters in the request
     - Headers
     - Parameters are like filters to modify the scope of the request
         - check API documentation
         
- Reddit API
    - Scape the news in Reddit 
    - With a user-agent header 
    - Is the Reddit API free?
        - Not all apps on Reddit will have to pay. The following conditions, effective as of June 1, enable free access to the data API: Apps that make fewer than 100 queries per minute using OAuth authentication and 10 queries per minute not using OAuth can use the API free of charge ([source](https://www.techtarget.com/whatis/feature/Reddit-pricing-API-charge-explained?Offer=abt_pubpro_AI-Insider)).
     - Reddit API official document: https://www.reddit.com/dev/api/.
     
- Dealing with json:
    - `pprint`: The pprint module in Python is a utility module that you can use to print data structures in a readable, pretty way. It's a part of the standard library that's especially useful for debugging code dealing with API requests, large JSON files, and data in general.

1. Let's scrape the news in Reddit with API. 

    - http://www.reddit.com/r/news: The place for news articles about current events in the United States and the rest of the world.
    
    - http://www.reddit.com/r/Baruch: Baruch College's student run subreddit.

Note: this example is adapted from this tutorial on [towardsdatascience](http://towardsdatascience.com/a-beginners-guide-to-accessing-data-withweb-apis-using-python-23d262181467).

In [106]:
# import the packages
import requests, json

payload = {
    'limit': 5,
    't': 'hot'}

headers = {
    'User-agent': 'Reddit bot 1.0'}

endpoint = 'http://www.reddit.com/r/news/top.json'

# can try other channel
# endpoint = 'http://www.reddit.com/r/funny/top.json'
endpoint = 'http://www.reddit.com/r/Baruch/new.json'

r = requests.get(endpoint, headers = headers, params = payload)
r_json = json.loads(r.content)

# USE pprint to figure the hierarchy in json data
# import pprint
# pprint.pprint(r)

# extract elements from json data
for sub in (r_json['data']['children']):
    title = sub['data']['title']
    print(title)

Do I have to apply for TAP in order to apply for the Excelsior Program ?
Does anyone know how long it takes Baruch to get CLEP scores?
Thoughts on Jessica Webster for LIB 3030?
Has anyone had Linda Dukette for Bus 9558? What's the course load for bus 9558 in general?
just made my fall 2023 schedule! any feedback on these courses/professors?


In [133]:
import pprint
# pprint.pprint(r.json()['data']['children'])

### Authenticatoin for News API

News API is a simple HTTP REST (REpresentational State Transfer) API for searching and retrieving live articles from various sources

REST means architectural constraints and here is an article aboout its components: https://restfulapi.net/.

- Get your secret API key
    - Go to: https://newsapi.org/docs/get-started


- MY API key: [*****************************************]()
    - Go to API_credentials.ipynb to find API id and sceret keys.
    - You should apply one for yourself.


- Read the terms of service: https://newsapi.org/terms
    - Don't violate the use term
    - attribution: The attribution should preferrably be a hyperlink to https://newsapi.org with the text "Powered by News API".
    
    
- Since we do Python, we can use the Python client library: https://newsapi.org/docs/client-libraries/python.

#### Practice 1
Now let's try to use Python client of news API to fetch news. Before getting on the code, you need to install `newsapi` package in Python.

- Some users have issues with newsapi importing.
- Try install `newsapi-python` and import the package with `from newsapi.newsapi_client import NewsApiClient`
- This is recorded in GitHub issue [here](https://github.com/mattlisiv/newsapi-python/issues/29).

We will use 2 main endpoints in News API:

- Top headlines `/v2/top-headlines`: returns breaking news headlines for countries, categories, and singular publishers. This is perfect for use with news tickers or anywhere you want to use live up-to-date news headlines.


- Everything `/v2/everything`: search every article published by over 80,000 different sources large and small in the last 5 years. This endpoint is ideal for news analysis and article discovery.


- There is also a minor endpoint that can be used to retrieve a small subset of the publishers we can scan:
    - Sources `/v2/top-headlines/sources`: returns information (including name, description, and category) about the most notable sources available for obtaining top headlines from. This list could be piped directly through to your users when showing them some of the options available.
    
Reference: https://newsapi.org/docs/endpoints

In [12]:
# %%cmd 
# pip install newsapi
# pip install newsapi-python

In [42]:
newsapi_key = '*****'

1. Top headlines

In [100]:
# /v2/top-headlines/sources
sources = newsapi.get_sources()

# check sources of the news
sources['sources'][30]

{'id': 'entertainment-weekly',
 'name': 'Entertainment Weekly',
 'description': 'Online version of the print magazine includes entertainment news, interviews, reviews of music, film, TV and books, and a special area for magazine subscribers.',
 'url': 'http://www.ew.com',
 'category': 'entertainment',
 'language': 'en',
 'country': 'us'}

In [132]:
# from newsapi import NewsApiClient
from newsapi.newsapi_client import NewsApiClient

# Initialize
newsapi = NewsApiClient(api_key= newsapi_key)

# /v2/top-headlines
top_headlines = newsapi.get_top_headlines(q='AI',
                                          category='business',
                                          language='en',
                                          country='us')

In [139]:
top_headlines['articles'][0]

{'source': {'id': None, 'name': 'Mediaite'},
 'author': None,
 'title': 'Elon Musk Challenges Mark Zuckerberg to Fight at Facebook Chief’s House Tomorrow, Text Shows - Mediaite',
 'description': "In a screenshotted text sent to his biographer, Musk challenged Zuckerberg to an MMA fight on Monday. And he's even willing to cede the home octagon advantage.",
 'url': 'https://www.mediaite.com/sports/elon-musk-challenges-mark-zuckerberg-to-fight-at-facebook-chiefs-house-tomorrow-text-shows/',
 'urlToImage': 'https://www.mediaite.com/wp-content/uploads/2023/07/Zuckerberg-and-Musk.jpg',
 'publishedAt': '2023-08-13T15:48:00Z',
 'content': None}

2. Everything

In [114]:
# /v2/everything
all_articles = newsapi.get_everything(q='AI',
                                      sources='bbc-news,the-verge',
                                      domains='bbc.co.uk,techcrunch.com',
                                      from_param='2023-08-01',
                                      to='2023-08-13',
                                      language='en',
                                      sort_by='relevancy',
                                      page=2)

In [127]:
all_articles['articles'][89]

{'source': {'id': 'techcrunch', 'name': 'TechCrunch'},
 'author': 'Walter Thompson',
 'title': 'TechCrunch+ Roundup: Creator economy VC survey, B2C fintech growth strategy, web3 demo day | TechCrunch',
 'description': 'TechCrunch+ Roundup: Creator economy VC survey, B2C fintech growth strategy, web3 demo day | TechCrunchtechcrunch.com',
 'url': 'https://techcrunch.com/2023/08/11/creator-economy-vc-survey-b2c-fintech-growth-strategy-web3-demo-day/',
 'urlToImage': 'https://techcrunch.com/wp-content/uploads/2023/08/GettyImages-1586299819.jpg?resize=1200,800',
 'publishedAt': '2023-08-11T16:59:21Z',
 'content': 'There are a million reasons why startups fail, and there are only a few reasons why they succeed.\r\nAll successful startups share the same proof points, such as product-market fit, strong compounded g… [+6250 chars]'}

In [129]:
all_articles['articles'][89]['title']

'TechCrunch+ Roundup: Creator economy VC survey, B2C fintech growth strategy, web3 demo day | TechCrunch'

#### Practice 2
Use JSON-based REST API for data requests.

In [150]:
# Ex.4
# source: https://www.geeksforgeeks.org/fetching-top-news-using-news-api/
# BBC news api with authorization header and parameters

import requests 

# headers to store the API key
headers = {'Authorization': newsapi_key}

query_params = {
    "source": "bbc-news",
    "sortBy": "top"}

main_url = " https://newsapi.org/v1/articles"
 
# fetching data in json format
res = requests.get(main_url, headers = headers, params=query_params)
open_bbc_page = res.json()
 
# getting all articles in a string article
articles = open_bbc_page["articles"]
 
# empty list to hold all trending news
results = []
     
for article in articles:
    results.append(article["title"])

# printing all trending news       
for i in range(len(results)):            
    print(i + 1, results[i], len(results[i]))


1 Maui fire: Search for victims intensifies after 80 deaths 57
2 Watch: Onboard the boat bringing aid to fire devastated Maui 60
3 Hawaii fires: Jason Momoa warns tourists not to visit Maui 58
4 Ukraine war: Three-week-old baby and family among seven killed in Russian shelling 82
5 New Zealand's youth vaping crisis clouds smoke-free future 58
6 Miss Universe organisation cuts Indonesia ties over sex abuse claims 68
7 Ecuador: Thousands of soldiers move gang leader Fito 52
8 Perseid meteor shower lights up night sky 41
9 Can this battery-swapping bike tech unchoke cities? We took a ride 66
10 Watch: Thief suspect plucked from drain hiding spot 51


Same as the code below using Python client library.

In [152]:
top_headlines = newsapi.get_top_headlines(sources = 'bbc-news')

In [148]:
[article['title'] for article in top_headlines['articles']]

['Watch: Onboard the boat bringing aid to fire devastated Maui',
 "New Zealand's youth vaping crisis clouds smoke-free future",
 'Hawaii fires: Jason Momoa warns tourists not to visit Maui',
 'Ukraine war: Three-week-old baby and family among seven killed in Russian shelling',
 'Miss Universe organisation cuts Indonesia ties over sex abuse claims',
 'Ecuador: Thousands of soldiers move gang leader Fito',
 'Watch: Thief suspect plucked from drain hiding spot',
 'Perseid meteor shower lights up night sky',
 'Maui fire: Search for victims intensifies after 80 deaths',
 'Can this battery-swapping bike tech unchoke cities? We took a ride']