# Google News Web Scraping 
### Extracting Google News results using Python and SerpApi web scraping library.

In [9]:
# Import libraries:
import requests
import json
import re
from parsel import Selector

|LIBRARY	|     PURPOSE                |
|:----------|:---------------------------|
|requests	|-to make a request to the website.|
|json	    |-to convert extracted data to a JSON object.|
|re	        |-to extract parts of the data via regular expression.|
|parsel	    |-to parse data from HTML/XML documents. Similar to BeautifulSoup but supports XPath.                               .|

In [10]:
# Create request headers and URL parameters:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36"
}

params = {
    "q": "climate change",  # search query
    "hl": "en",              # language of the search
    "gl": "uk",              # country of the search
    "num": "100",            # number of search results per page
    "tbm": "nws"             # news results
}

|CODE	     | EXPLANATION                          |
|:-----------|:-------------------------------------|
|params	     |a prettier way of passing URL parameters to a request.|
|user-agent	 |to act as a "real" user request from the browser by passing it to request headers. Default requests user-agent is a python-reqeusts so websites might understand that it's a bot or a script and block the request to the website. Check what's your user-agent.|

Make a request, pass created request parameters and headers. Pass returned HTML to parsel:

In [11]:
html = requests.get("https://www.google.com/search", headers=headers, params=params, timeout=30)
selector = Selector(text=html.text)

|CODE	      |EXPLANATION                             |
|:------------|:---------------------------------------|
|timeout=30	  |to stop waiting for response after 30 seconds.|
|Selector(text=html.text)| 	where passed HTML from the response will be processed by parsel.|

Create an empty list to store extracted news results:

In [12]:
news_results = []

Create a variable that will hold store < script > tags from the page:

In [14]:
all_script_tags = selector.css("script::text").getall()

|CODE	    |EXPLANATION                                                       |
|:----------|:-----------------------------------------------------------------|
|css()	    |is a parsel method that extracts nodes based on a given CSS selector(s).|
|::text	    |is a parsel own pseudo-element support that extracts textual data, which will translate every CSS query to XPath. In this case ::text would become /text() if using XPath directly. |
|getall()	|returns a list of matched nodes.|

Iterate over news results and extract thumbnails data (skip to the next step if you don't want thumbnails):

In [15]:
for result, thumbnail_id in zip(selector.css(".xuvV6b"), selector.css(".FAkayc img::attr(id)")):
    thumbnails = re.findall(r"s=\'([^']+)\'\;var\s?ii\=\['{_id}'\];".format(_id=thumbnail_id.get()), str(all_script_tags))

    decoded_thumbnail = "".join([
        bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in thumbnails
    ])

|CODE	        |EXPLANATION                      |
|:--------------|:--------------------------------|
|zip()	        |iterate over several iterables in parallel. In this case zip is used to also extract thumbnails that are located in the < script > tags. |
|::attr(id)	    |parsel own pseudo-element support that will extract given attribute from an HTML node.|
|re.findall()	|match parts of the data from HTML using regular expression pattern. In this case, we want to match thumbnails. If you parse thumbnails directly from the HTML, you'll get a 1x1 image placeholder, not thumbnail. findall returns a list of matches. |
|format(_id=thumbnail_id.get())|	format is a Python string format that insert passed values inside the string's placeholder, which is _id in this case: \['{_id}'\];|
|str(all_script_tags)	|used to type cast returned value to a string type.|
|"".join()	   |join all items into a single string. Since this example uses list comprehension, the returned output would be a list of each processed element: [thumbnail_1] [thumbnail_2] [thumbnail_3] or [] if empty. join will convert join list to str
|bytes(img, "ascii").decode("unicode-escape")|	to decode parsed image data.|

Append extracted results to a temporary list as a dict:

In [21]:
news_results.append(
    {
        "title": result.css(".MBeuO::text").get(),
        "link": result.css("a.WlydOe::attr(href)").get(),
        "source": result.css(".NUnG9d span::text").get(),
        "snippet": result.css(".GI74Re::text").get(),
        "date_published": result.css(".ZE0LJd span::text").get(),
        "thumbnail": None if decoded_thumbnail == "" else decoded_thumbnail
    }
)

Print extracted data:

In [27]:
print(json.dumps(news_results, indent=2, ensure_ascii=False))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



#### Full Pagination Code
The only difference is that we adding:

- while loop to iterate over all pages.
- if statement to check for the "Next" presense.
- increments params["start"] parameter by 10 to paginate to the next page.

In [23]:
import requests, json, re
from parsel import Selector

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36"
}

params = {
    "q": "gta san andreas",  # search query
    "hl": "en",              # language of the search
    "gl": "us",              # country of the search
    "num": "100",            # number of search results per page
    "tbm": "nws",            # news results
    "start": 0               # page nubmer
}

news_results = []
page_num = 0

while True:
    page_num += 1
    print(page_num)

    html = requests.get("https://www.google.com/search", headers=headers, params=params, timeout=30)
    selector = Selector(text=html.text)

    # extract thumbnails
    all_script_tags = selector.css("script::text").getall()

    for result, thumbnail_id in zip(selector.css(".xuvV6b"), selector.css(".FAkayc img::attr(id)")):
        thumbnails = re.findall(r"s=\'([^']+)\'\;var\s?ii\=\['{_id}'\];".format(_id=thumbnail_id.get()), str(all_script_tags))

        decoded_thumbnail = "".join([
            bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in thumbnails
        ])

        news_results.append(
            {
                "title": result.css(".MBeuO::text").get(),
                "link": result.css("a.WlydOe::attr(href)").get(),
                "source": result.css(".NUnG9d span::text").get(),
                "snippet": result.css(".GI74Re::text").get(),
                "date_published": result.css(".ZE0LJd span::text").get(),
                "thumbnail": None if decoded_thumbnail == "" else decoded_thumbnail
            }
        )

    if selector.css(".d6cvqb a[id=pnnext]").get():
        params["start"] += 10
    else:
        break 

print(json.dumps(news_results, indent=2, ensure_ascii=False))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

