# NZU price data scraping

[CarbonNews](https://www.carbonnews.co.nz/) is described as "New Zealand’s only daily news service covering the carbon markets, climate change, sustainable business and the growth of the low-carbon economy". It is a private business, so to read the published stories in full one must pay a subscription fee. However, some useful information can be gleaned from headlines and short text summaries visible to non-subscribers. 

The [Jarden NZ Market Report](https://www.carbonnews.co.nz/tag.asp?tag=Jarden+NZ+Market+Report) section lists stories with updates on the price of NZUs traded on [CommTrade](https://www.commtrade.co.nz/). Each story (such as [this](https://www.carbonnews.co.nz/story.asp?storyID=28274) one) reports on the latest "fixing", i.e. spot price, as well as the opening bid and offer prices. Since early 2016, the spot price has been quoted in every story's headline, while the opening bid and offer prices are given in the accompanying summary. For older archived stories, however, the spot price is not always featured in the headline but only in the summary, and the formatting of both the headline and the summary is less consistent.

![CarbonNews story](./CarbonNewsStory.png)

Note that the price history plotted in the image accompanying each story covers only the last six months, whereas the [Jarden NZ Market Report](https://www.carbonnews.co.nz/tag.asp?tag=Jarden+NZ+Market+Report) section and its [archive](https://www.carbonnews.co.nz/tagarchive.asp?tag=Jarden+NZ+Market+Report) go back years. Hence, we would like to scrape all the available price values, with the corresponding date and source story URL, and save this data to a Comma Separated Values (CSV) file in the follow format:
```
"date","price","source"
24-07-2023,47.25,'https://www.carbonnews.co.nz/story.asp?storyID=28263'
25-07-2023,50.00,'https://www.carbonnews.co.nz/story.asp?storyID=28274'
```

In general, data scraping involves parsing the HTML source code of a web page of interest, and in the present case we have two to contend with: [Jarden NZ Market Report](https://www.carbonnews.co.nz/tag.asp?tag=Jarden+NZ+Market+Report) and its [archive](https://www.carbonnews.co.nz/tagarchive.asp?tag=Jarden+NZ+Market+Report). These two web pages contain all the information required to produce the desired CSV file, but only as far back as February 2016. For hundreds of older stories, to get the price value we will need to find and parse the web page for each story individually. 

## Inspection using browser
From simply looking at the [Jarden NZ Market Report](https://www.carbonnews.co.nz/tag.asp?tag=Jarden+NZ+Market+Report) web page in a browser we see a listing of stories, each with a clickable headline, a brief text summary, and a graph. Most (if not all) of the headlines state the spot price of NZUs, and each accompanying summary begins with the date when the story was published. Somewhat inconveniently, the date formatting is variable: showing just "Today" or the appropriate weekday for stories that are less than a week old, and the date in full (e.g. "25 Jul 23") only for older stories.  

![](./Report_h1h2.png)
![](./Report_h2h3.png)

Inspecting the HTML source reveals that the central listing of stories is associated with a `<div>` element of class `"StoryList"`. Inside this element, the latest headline in the listing is associated with the element tagged by `<h1>`, the following six headlines are each tagged by `<h2>`, and the remaining ones by `<h3>`; and all these elements have the same class name `"Headline"` attributed to them. The actual headline text is nested inside an `<a>` sub-element with an `href` attribute (defining a hyperlink to the full story). Furthermore, each and every headline element is followed by an accompanying `<p>` element containing the story's brief summary.       

Inspection of [Jarden NZ Market Report Archive](https://www.carbonnews.co.nz/tagarchive.asp?tag=Jarden+NZ+Market+Report) shows continuation of the same general pattern: first twenty stories in the archive are associated with `<h3>` elements (containing the headline) and accompanying `<p>` elements (containing the summary); while all older stories are tagged by `<h4>` (without any accompanying `<p>` elements). For these older archived stories, the full date is embedded in the corresponding `<h4>` element but outside the internal `<a>` sub-element.

![](./Archive_h3.png)
![](./Archive_h3h4.png)

Now, having gleaned the underlying HTML structure, we can proceed with the actual data scraping. 

## Scraping with Beautiful Soup

To scrape the data using a Python package called Beautiful Soup, we need to first use Python's `request` module to get the entire HTML source of a given website. We are interested in two URLs:
```
https://www.carbonnews.co.nz/tag.asp?tag=Jarden+NZ+Market+Report
https://www.carbonnews.co.nz/tagarchive.asp?tag=Jarden+NZ+Market+Report
```
So let us get and parse them separately, and store them as two beautiful soups.

In [2]:
import requests
from bs4 import BeautifulSoup

url="https://www.carbonnews.co.nz/"

page = requests.get(url+"tag.asp?tag=Jarden+NZ+Market+Report")
soup = BeautifulSoup(page.content, "html.parser")

page = requests.get(url+"tagarchive.asp?tag=Jarden+NZ+Market+Report")
soup2 = BeautifulSoup(page.content, "html.parser") 

Using our knowledge of the underlying HTML structure, we can pick out the headline and summary elements for each story in the soups, and store the extracted elements in lists.

In [3]:
# Find all the h1, h2, and h3 headlines in first soup
helements = soup.find_all(["h1","h2","h3"], class_="Headline")
# Find the h3 headlines in second soup
helements+= soup2.find_all("h3", class_="Headline")

# Find all the accompanying summaries
pelements = soup.find_all("p", class_=None ) # the h1 headline
pelements+= soup.find_all("p", class_=["StoryIntro","StoryIntro_small"]) # h2 and h3 headlines
pelements+= soup2.find_all("p", class_="StoryIntro_small") # h3 headlines from the second soup

# Find all h4 headlines in the second soup
helements_arch = soup2.find_all(["h4"], class_="Headline")

Note that `helements` and `pelements` should be commensurate, while `helements_archive` should be much longer. 

In [4]:
print(len(helements), len(pelements))
print(len(helements_arch))
print(helements_arch[0])

39 39
2760
<h4 class="Headline" xstyle="font-weight:normal;"><img alt="" height="8" src="images/arrow.gif" width="8"> 9 Jun 23  <a href="story.asp?storyID=27911">Failure is not an auction</a></img></h4>


Let us now extract just the three relevant bits of text for each and every story: the headline string, the date string, and the href string.

In [6]:
headlines = []; datestrings = []; hrefs = []

# First loop over more recent headlines with summaries
for i in range(len(helements)):
    body = helements[i].find("a")
    headlines.append(body.text.strip())
    hrefs.append(body.get("href"))
    # The date string is the first text before the ' - ' in the summary.
    datestrings.append(pelements[i].text.strip().split(' - ')[0])
        
# Then loop over older archived headlines without summaries
for h in helements_arch:
    body = h.find("a")
    headline = body.text.strip()
    headlines.append(headline)
    hrefs.append(body.get("href"))
    # The date string is the text that's not part of the headline 
    datestrings.append(h.text.strip().replace(headline,''))

Check that the three lists are of the same length, and then print the first few and the last few entries in each list, just for illustration.

In [13]:
print(len(headlines), len(datestrings), len(hrefs),'\n')

for i in range(8):
    print(datestrings[i],' | ',headlines[i],' | ',hrefs[i])

print()

for i in range(1,9):
    print(datestrings[-i],' | ',headlines[-i],' | ',hrefs[-i])

2799 2799 2799 

Friday  |  MARKET LATEST: NZUs $58.00  |  story.asp?storyID=28371
Thursday  |  MARKET LATEST: NZUs $57.00  |  story.asp?storyID=28360
Wednesday  |  MARKET LATEST: NZUs $57.50  |  story.asp?storyID=28345
Tuesday  |  MARKET LATEST: NZUs $59.75  |  story.asp?storyID=28334
Monday  |  MARKET LATEST: NZUs $60.00  |  story.asp?storyID=28323
28 Jul 23  |  MARKET LATEST: NZUs $61.50  |  story.asp?storyID=28310
27 Jul 23  |  MARKET LATEST: NZUs $65.25  |  story.asp?storyID=28298
26 Jul 23  |  MARKET LATEST: NZUs $65.00  |  story.asp?storyID=28285

13 Jun 08    |  Current carbon credits available  |  story.asp?storyID=1194
13 Jun 08    |  Latest strip of CERs   2008  2012 vintage  indicative mid prices  |  story.asp?storyID=1193
17 Jun 08    |  Current Carbon Credits Available  |  story.asp?storyID=1219
17 Jun 08    |  Latest Strip of CERs   2008  2012 Vintage  Indicative Mid Prices  |  story.asp?storyID=1218
17 Jun 08    |  Oil pushes carbon higher  |  story.asp?storyID=12

Furthermore, the date is not actually included in the headlines, but it nonetheless can still be scraped from just the two main web pages, though the date formatting there makes it more awkward to parse than when scraping from individual story pages. 

So, for demonstration purposes, let us take a multi-prong approach with some redundancies, and we will two approaches using different toolsets:
- [Bash](https://www.gnu.org/software/bash/) and Linux command-line tools to scrape just the spot price from the headlines; and
- Python package called [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) to get the spot prices from the headings, and then also look for the headline price in the story's first paragraph. 


## Command-line approach

We are interested in two URLs:
```
https://www.carbonnews.co.nz/tag.asp?tag=Jarden+NZ+Market+Report
https://www.carbonnews.co.nz/tagarchive.asp?tag=Jarden+NZ+Market+Report
```

First download the HTML source using `wget`, then use `grep`, `sed`, and `awk` to get the values.

## Beautiful Soup approach

All CarbonNews stories can be accessed via public URLs such as 
```
https://www.carbonnews.co.nz/story.asp?storyID=28274
```
for the story pictured above. Given such a URL, we can obtain the page HTML code and scrape the required information from that. Conveniently, the HTML source follows a fairly simple structure, and the entire text snippet can be extracted by   

In [4]:
def check(headline):
    
    if('MARKET LATEST:' in headline or 
       ('NZU' in headline and headline.count('$') == 1)):
        result = True
    else:
        result = False
    
    return result

In [8]:
def parse_stories(hrefs):
    
    import requests
    from bs4 import BeautifulSoup
    from datetime import datetime
    
    data = []
    url_home="https://www.carbonnews.co.nz/"
    
    for href in hrefs:
        
        url = url_home+href
        page = requests.get(url)
        soup = BeautifulSoup(page.content, "html.parser")
        
        # Process the heading, which should be the first 'h1' element in soup
        h1elements = soup.find_all("h1", class_="story")
        heading = h1elements[0].text.strip()
        prices = [word.strip('$.') for word in heading.split() if word[0] == '$']
        if len(prices) == 1:
            price = prices[0]
        else:
            price = 'NaN'
            if len(price) > 1:
                print('WARNING: Heading contains multiple dollar values.')
            else:
                print('WARNING: Heading contains no dollar values.')        
            
        
        # Process text snippet, which should be the only div element of class StoryFirstPara in soup
        para = soup.find("div", class_="StoryFirstPara").text.strip()
        prices = [word.strip('$.') for word in para.split() if word[0] == '$']
        if price not in prices:
            print('WARNING: Heading price not in story text snippet.')
        
        # Process the story date
        h4elements = soup.find_all("h4")
        date = h4elements[0].text.strip()        
        if date.split()[0] == 'Today':
            date = datetime.today().strftime('%Y-%m-%d')
        else:
            date = ' '.join(date.split()[1:-1]) # strip weekday and time
            date = datetime.strptime(date,'%d %b %y').strftime('%Y-%m-%d')
        
        data.append({'date':date, 'price':price, 'url':url})
    
    return data