#Web Scraping using BeautifulSoup

The aim is to scrape a website and convert the data extracted into a csv file

##Hacker News

The website selected for the scraping is a tech-news website, which contains daily news articles about various technologies.
* The main webpage of the site contains titles, links and the votes which are given by the readers.

* We are going to scrape the website for the titles of the articles and their respective links and votes.

* We are going to sort the articles bases on the highest votes polled.

In [1]:
import requests
from bs4 import BeautifulSoup
import pprint
#res is request to get the first page of the website
res = requests.get('https://news.ycombinator.com/news')
#res2 is the request the second page of the website
res2 = requests.get('https://news.ycombinator.com/news?p=2')
soup = BeautifulSoup(res.text, 'html.parser')
soup2 = BeautifulSoup(res2.text, 'html.parser')

links = soup.select('.titlelink') #.titlelink
#Subtext of the title of first page
subtext = soup.select('.subtext')
# titles and subtexts of the second page
links2 = soup2.select('.titlelink') #.titlelink
subtext2 = soup2.select('.subtext')
# appending the links and subtexts we obtained
mega_links = links + links2
mega_subtext = subtext + subtext2

# Sorting the articles based on the highest votes
def sort_stories_by_votes(hnlist):
  return sorted(hnlist, key= lambda k:k['votes'], reverse=True)

# Creating an array of the dictionaries formed by the titles, links and votes
def create_custom_hn(links, subtext):
  hn = []
  for idx, item in enumerate(links):
    title = item.getText()
    href = item.get('href', None)
    vote = subtext[idx].select('.score')
    if len(vote):
      points = int(vote[0].getText().replace(' points', ''))
      if points > 99:
        hn.append({'title': title, 'link': href, 'votes': points})
  return sort_stories_by_votes(hn)
 
pprint.pprint(create_custom_hn(mega_links, mega_subtext))

[{'link': 'https://bert.org/2022/06/02/payphone/',
  'title': 'Installing a payphone in my house',
  'votes': 1002},
 {'link': 'https://vscodium.com/',
  'title': 'VSCodium – Free/Libre Open Source Software Binaries of VS Code',
  'votes': 657},
 {'link': 'https://hirrolot.github.io/posts/rust-is-hard-or-the-misery-of-mainstream-programming.html',
  'title': 'Rust Is Hard, Or: The Misery of Mainstream Programming',
  'votes': 498},
 {'link': 'https://www.webosarchive.com',
  'title': "Show HN: I restored Palm's webOS App Catalog, SDK and online help "
           'system',
  'votes': 405},
 {'link': 'https://www.tbray.org/ongoing/When/202x/2022/06/02/Dangerous-Gift',
  'title': 'Dangerous Gift',
  'votes': 308},
 {'link': 'https://www.zkcrush.xyz/',
  'title': 'Confess your love with zero-knowledge',
  'votes': 279},
 {'link': 'https://ukdefencejournal.org.uk/classified-specs-leaked-on-war-thunder-forum-for-third-time/',
  'title': 'Classified specs leaked on War Thunder forum for third

In [2]:
import pandas as pd
df = pd.DataFrame(create_custom_hn(mega_links, mega_subtext))
df

Unnamed: 0,title,link,votes
0,Installing a payphone in my house,https://bert.org/2022/06/02/payphone/,1002
1,VSCodium – Free/Libre Open Source Software Bin...,https://vscodium.com/,657
2,"Rust Is Hard, Or: The Misery of Mainstream Pro...",https://hirrolot.github.io/posts/rust-is-hard-...,498
3,"Show HN: I restored Palm's webOS App Catalog, ...",https://www.webosarchive.com,405
4,Dangerous Gift,https://www.tbray.org/ongoing/When/202x/2022/0...,308
5,Confess your love with zero-knowledge,https://www.zkcrush.xyz/,279
6,Classified specs leaked on War Thunder forum f...,https://ukdefencejournal.org.uk/classified-spe...,259
7,Shimano Forces Hammerhead to Remove All Di2 Re...,https://www.dcrainmaker.com/2022/05/shimano-fo...,246
8,Ceiling Air Purifier,https://www.jefftk.com/p/ceiling-air-purifier,245
9,Async Rust doesn't have to be hard,https://itsallaboutthebit.com/async-simple/,226


In [3]:
df['votes'].dtype

dtype('int64')

In [4]:
df['link'].dtype

dtype('O')

In [5]:
#to render links in the dataframe
from IPython.display import HTML
HTML(df.to_html(render_links=True, escape=False))

Unnamed: 0,title,link,votes
0,Installing a payphone in my house,https://bert.org/2022/06/02/payphone/,1002
1,VSCodium – Free/Libre Open Source Software Binaries of VS Code,https://vscodium.com/,657
2,"Rust Is Hard, Or: The Misery of Mainstream Programming",https://hirrolot.github.io/posts/rust-is-hard-or-the-misery-of-mainstream-programming.html,498
3,"Show HN: I restored Palm's webOS App Catalog, SDK and online help system",https://www.webosarchive.com,405
4,Dangerous Gift,https://www.tbray.org/ongoing/When/202x/2022/06/02/Dangerous-Gift,308
5,Confess your love with zero-knowledge,https://www.zkcrush.xyz/,279
6,Classified specs leaked on War Thunder forum for third time,https://ukdefencejournal.org.uk/classified-specs-leaked-on-war-thunder-forum-for-third-time/,259
7,Shimano Forces Hammerhead to Remove All Di2 Related Functionality From Karoo,https://www.dcrainmaker.com/2022/05/shimano-forces-hammerhead-to-remove-all-di2-related-functionality-from-karoo.html,246
8,Ceiling Air Purifier,https://www.jefftk.com/p/ceiling-air-purifier,245
9,Async Rust doesn't have to be hard,https://itsallaboutthebit.com/async-simple/,226


In [6]:
df['title'].dtype

dtype('O')

In [7]:
## convert the df to csv file
df.to_csv('file.csv')