# Web Scraping

There is an ethical way to scrape websites. Simply check out the `/robots.txt` file and/or check to see if there is an API.

The standard library for web scraping is [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
url_root = 'https://news.ycombinator.com/news'
urls = [url_root, f'{url_root}?p=2', f'{url_root}?p=3', 
        f'{url_root}?p=4', f'{url_root}?p=5']
df_all = None
for url in urls:
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    links = soup.select('.titlelink')
    scores = soup.select('.subtext')
    titles = [l.getText() for l in links]
    hrefs = [l.get('href') for l in links]
    votes = [0 if s.select('.score') == [] else int(s.select('.score')[0].getText().split(' ')[0]) for s in scores]
    df = pd.DataFrame({
        'titles': titles, 
        'links': hrefs, 
        'points': votes
    })
    df = df[df.points >= 100]
    if df_all is None:
        df_all = df.copy()
    else:
        df_all = pd.concat([df_all, df])
df_all = df_all.sort_values('points', ascending=False).reset_index(drop=True)

In [3]:
df_all

Unnamed: 0,titles,links,points
0,55 GiB/s FizzBuzz,https://codegolf.stackexchange.com/questions/2...,1041
1,Facebook Renames to Meta,https://about.facebook.com/meta/,1037
2,A patent troll backs off,https://www.sparkfun.com/news/3970,789
3,"I was rejected by Codecademy three times, so I...",https://codeamigo.dev?ref=HN,766
4,New MacBook Pro has first ‘DIY-friendly’ batte...,https://www.ifixit.com/News/54122/macbook-pro-...,692
...,...,...,...
60,How spies are caught (2001),https://www.wrc.noaa.gov/wrso/security_guide/c...,108
61,The M1 Max is the fastest GPU we have ever mea...,https://twitter.com/andysomerfield/status/1451...,104
62,Programmer Moneyball (2016),http://danluu.com/programmer-moneyball/,100
63,The 50-year-old P-NP problem that eludes theor...,https://www.technologyreview.com/2021/10/27/10...,100
