In [None]:
%pip install -q -r ../requirements.txt

# Web scraping Hacker News

The website [Hacker News](https://news.ycombinator.com/) is a great source of information for developers and tech enthusiasts. It is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator.

Let's scrape the website and extract the following information from the first page of the website:

- Title of the post
- Link to the post

We will use the `requests` and `beautifulsoup` libraries to scrape the website.

Let's start by importing the necessary libraries and defining a function to scrape the website.

In [25]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_website(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

We're using the `requests` library to download the HTML content of the website and the `beautifulsoup` library to parse the HTML content and extract the information we need.

In [None]:
url = "https://news.ycombinator.com/news"

soup = scrape_website(url)

print(soup.prettify())

We've scraped the landing page, but we need to extract the link information for the posts. Let's inspect the HTML content of the website to find the class name of the link element.

## Example HTML fragment

<span class="titleline"><a href="https://flyonui.com/">Show HN: Flyon UI – Free Tailwind Components Library</a><span class="sitebit comhead"> (<a href="from?site=flyonui.com"><span class="sitestr">flyonui.com</span></a>)</span></span>

```html
<span class="titleline">
	<a href="https://flyonui.com/">Show HN: Flyon UI – Free Tailwind Components Library</a>
	<span class="sitebit comhead">
		(<a href="from?site=flyonui.com"><span class="sitestr">flyonui.com</span></a>)
	</span>
</span>
```

The class `<a>` tag contains the link (`href` attribute) and the title of the post, but we need to find the parent `<span>` tag with the class `titleline` to extract the link and title.

We'll start by finding all the `<span>` tags with the class `titleline`.

In [None]:
titlelines = soup.find_all('span', class_='titleline')

titlelines

We can now find the child `<a>` tag to extract the link and title.

In [None]:
data = []

for titleline in titlelines:
    link = titleline.find('a')
    if link:
        # Extract titles and links
        data.append({
            'title': link.text,
            'link': link['href']
        })

data

We can make build a new Pandas `DataFrame` with the extracted information, which will make it easier to analyze and visualise the data.

In [None]:
df = pd.DataFrame(data)

df.head()

## Exercise

1. Modify the code to scrape a different website of your choice
2. Extract additional information (e.g., article date, author, number of comments)
3. Save the results to a CSV file