# Python, show me the news
Today, we will learn about **web scraping**, which is using programs to gather data from websites. Think about your favorite sports team, the news, or perhaps a product you really want. After today, you will be able to get information about them straight from the internet using Python.

We will be using the `requests` library to send HTTP requests to websites and the `Beautiful Soup` library to parse the website source code returned from the HTTP requests. Before we do anything else, let's `import` these libraries.

In [1]:
import requests
from bs4 import BeautifulSoup

Now let's talk briefly about how web scraping works.

Firstly, webpages are made using a coding language called HTML (HyperText Markup Language) in addition to other languages that we need not worry about for now.

When we send an HTTP request to a webpage, the web server of the site returns to us the HTML source code of the website, which includes the text, images, links, videos, etc of the site.

Next, we use an HTML parser to parse the HTML data. Then, we can use Beautiful Soup to extract the data we want from the parsed data.

## HTTP request
Let's take a look at what we get from an HTTP request to https://www.google.com/

In [2]:
import requests

link = "https://www.google.com/"
response = requests.get(link)
print(response)
print(response.content)

<Response [200]>
b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/logos/doodles/2021/seasonal-holidays-2021-6753651837109324-law.gif" itemprop="image"><meta content="Seasonal Holidays 2021" property="twitter:title"><meta content="Seasonal Holidays 2021 #GoogleDoodle" property="twitter:description"><meta content="Seasonal Holidays 2021 #GoogleDoodle" property="og:description"><meta content="summary_large_image" property="twitter:card"><meta content="@GoogleDoodles" property="twitter:site"><meta content="https://www.google.com/logos/doodles/2021/seasonal-holidays-2021-6753651837109324-2xa.gif" property="twitter:image"><meta content="https

## Parsing HTML
Wow! That's quite some code. And that's just Google's homepage. We definitely need a parser to parse the code for us to extract any useful information.

To parse some HTML code, we can write:
```python
BeautifulSoup(HTML_CODE, PARSER)
```

In [3]:
import requests
from bs4 import BeautifulSoup

link = "https://www.google.com/"
response = requests.get(link)

soup = BeautifulSoup(response.content, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en">
 <head>
  <meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"/>
  <meta content="noodp" name="robots"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="/logos/doodles/2021/seasonal-holidays-2021-6753651837109324-law.gif" itemprop="image"/>
  <meta content="Seasonal Holidays 2021" property="twitter:title"/>
  <meta content="Seasonal Holidays 2021 #GoogleDoodle" property="twitter:description"/>
  <meta content="Seasonal Holidays 2021 #GoogleDoodle" property="og:description"/>
  <meta content="summary_large_image" property="twitter:card"/>
  <meta content="@GoogleDoodles" property="twitter:site"/>
  <meta content="https://www.google.com/logos/doodles/2021/seasonal-holidays-2021-6753651837109324-2xa.gif" property="twitter:image"

## Extracting information from parsed data
Now we want to get information from our parsed data. Before we go there, let's talk a little about HTML.

HTML documents are series of HTML **elements**. Some elements down below include `<html>`, `<body>`, `<h1>`, `<p>`, `<a>`, etc.

Some of our elements have **attributes**, like `href` and `class`. Another very important attribute that isn't shown below is `id`.

```html
<!DOCTYPE html>
<html>
  <body>
    <div>
      <h1>Example Domain</h1>
      <p class="p1">This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p>
      <p class="p2"><a href="https://www.iana.org/domains/example">More information...</a></p>
    </div>
  </body>
</html>
```

To get certain information from a web page, we can just specify the **tag name** and/or the **attributes** of the element. Let's use the html above as example.

In [4]:
from bs4 import BeautifulSoup

html = """
<!DOCTYPE html>
<html>
  <body>
    <div>
      <h1>Example Domain</h1>
      <p class="p1">This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p>
      <p class="p2"><a href="https://www.iana.org/domains/example">More information...</a></p>
    </div>
  </body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")
print(soup.find("h1"))
print(soup.find("h1").text)
print()

print(soup.find("a"))
print(soup.find("a").get("href"))
print()

print(soup.find("p", attrs={"class": "p1"}).text)
print(soup.find("p", attrs={"class": "p2"}).text)
print()

for p in soup.find_all("p"):
    print("- " + p.text)

<h1>Example Domain</h1>
Example Domain

<a href="https://www.iana.org/domains/example">More information...</a>
https://www.iana.org/domains/example

This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
More information...

- This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
- More information...


Now let's apply this to Google's homepage. Why don't we print out all the links on that page to start off with?

In [5]:
import requests
from bs4 import BeautifulSoup

link = "https://www.google.com/"
response = requests.get(link)

soup = BeautifulSoup(response.content, "html.parser")

links = soup.find_all("a")
for link in links:
    print(link.text + ": " + link.get("href"))

Images: https://www.google.com/imghp?hl=en&tab=wi
Maps: https://maps.google.com/maps?hl=en&tab=wl
Play: https://play.google.com/?hl=en&tab=w8
YouTube: https://www.youtube.com/?gl=US&tab=w1
News: https://news.google.com/?tab=wn
Gmail: https://mail.google.com/mail/?tab=wm
Drive: https://drive.google.com/?tab=wo
More »: https://www.google.com/intl/en/about/products?tab=wh
Web History: http://www.google.com/history/optout?hl=en
Settings: /preferences?hl=en
Sign in: https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/&ec=GAZAAQ
: /search?ie=UTF-8&q=seasonal+holidays&oi=ddle&ct=199210428&hl=en&sa=X&ved=0ahUKEwiGuoPKs7z0AhXtoFsKHSYvCHoQPQgD
Advanced search: /advanced_search?hl=en&authuser=0
Advertising Programs: /intl/en/ads/
Business Solutions: /services/
About Google: /intl/en/about.html
Privacy: /intl/en/policies/privacy/
Terms: /intl/en/policies/terms/


This is a really simple example of using web scraping to scrape a website, one that is probably useful to nobody. Now let's scrape something useful: the news.

The example below scrapes [Google News](https://news.google.com) for [Python-related news articles](https://news.google.com/topics/CAAqIQgKIhtDQkFTRGdvSUwyMHZNRFY2TVY4U0FtVnVLQUFQAQ). Feel free to scrape headlines about anything else.

In [6]:
import requests
from bs4 import BeautifulSoup

response = requests.get("https://news.google.com/topics/CAAqIQgKIhtDQkFTRGdvSUwyMHZNRFY2TVY4U0FtVnVLQUFQAQ")
soup = BeautifulSoup(response.content, "html.parser")

for headline in soup.find_all("a", attrs={"class": "DY5T1d RZIKme"}):
    print(headline.text)

11 Useful Python One-Liners You Must Know
Microsoft’s Pyjion compiler for Python reaches 1.0
Banks use Python weirdly as a programming language. That's ok
Move over Python — Rust is the highest paid programming language of 2021
Become a Python programmer with this online course bundle
Week in review: Windows EoP flaw still exploitable, GoDaddy breach, malicious Python packages on PyPI
Python stands to lose its GIL, and gain a lot of speed
Python ranks as the most popular programming language for the first time in 20 years
Python programming bootcamps guide: Invest in a tech career with the right bootcamp
Master Python Programming for Less Than $30
BFree Brings Intermittent Computing To Python
How to Use Python as a Command-Line Calculator
Programming languages: Faster Python project Pyston takes a big step forward
Python Is One of the Best Programming Languages for Entrepreneurs
Top Programming Languages 2021
Experimenting with Python implementation of Host Identity Protocol
How to get

And now, your turn! What are you interested in? If you can find it on a website, you can scrape it!