<a href="https://colab.research.google.com/github/freezingMonkeys/freezingMonkeysPythonTrack/blob/main/web_scraping_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. In this module, we will scrape basic information using requests, JSON, and XML.

---
# Extracting data through JSON:


First, we import the libraries

In [None]:
import requests
import json

Then, we set a url of the API we want to scrape. Here, we will use an example from the Hong Kong flights list.

In [None]:
date = input("input date (yyyy-mm-dd)")
url = f"https://www.hongkongairport.com/flightinfo-rest/rest/flights/past?date={date}&arrival=true&lang=en&cargo=false"

Use requests to get content of the page

In [None]:
json_page = requests.get(url).content # type = bytes

Use json to load json_page

In [None]:
flight_list = json.loads(json_page)

Afterwards, we can treat the list like a normal python list.

In [None]:
print(type(flight_list))
len(flight_list)

# the type is list, so we can extract the first and only element and see its format

In [None]:
type(flight_list[0])

In [None]:
# Since it is stored as a dictionary, we can process the data like a normal dictionary

# Example:
d = flight_list[0]
all_flights = d['list']

def print_flights(flights):
  for i in range(len(flights)):
    print(flights[i]["flight"][0]["airline"])
    print(flights[i]["flight"][0]["no"])
    print(flights[i]["origin"][0])
    print(flights[i]["status"])
    print()

print_flights(all_flights)



---


# Extract data through XML and BeautifulSoup:

First, we import the libraries

In [None]:
import requests
from bs4 import BeautifulSoup as soup # this imports BeautifulSoup with the shortened name soup

Same steps as before, use requests to get page content. Here, we will use the info from google news. They did not develop a way for us to easily access the information so we will have to scrape from the raw site. This is where BeautifulSoup comes in handy.

In [None]:
url = 'https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en'
xml_page = requests.get(url).content

Then, we use BeatifulSoup to process the scraped data.

In [None]:
soup_page = soup(xml_page, "xml")

soup_page

By printing the page, we find that all of the news are stored in an item tag. Since we want to create a list of news, we just have to extract the item tag.

In [None]:
news_list = soup_page.findAll('item')

Afterwards, we can process everything in a format like below (more explanation within code).

In [None]:
for news in news_list:
  # to get specific tags -> item.tag
  # .text grabs the text between the <item></item>
  print(news.title.text)
  print(news.link.text)
  # to grab specific item within the tag
  print(news.source.get('url'))
  print(news.pubDate.text)
  print()