# Web Scraping :
The process of extracting data from websites by parsing their HTML content.

## Beautiful Soup

To get started, install beautifulsoup4 and an HTML parser like lxml. Alternatively, you can use other parsers like html5lib.

- Working with a local file

In [1]:
from bs4 import BeautifulSoup

# Ok , here we are going to work with a local html file , so no need for request, we just read the html file
with open ('home.html', 'r') as file :
    content = file.read()
    
# now we turn it into a beautifulsoup object
soup = BeautifulSoup(content, 'lxml')

# printing in a readable format 
print(soup.prettify())

# find the first element in a specefic tag
tag = soup.find('h5')
# tag is a beatiful soup object so we can apply the methodes , here I printed the html code 
print(tag.prettify())

# find all element with that specefic tag, give a coherent name
courses_names_tags = soup.find_all('h5')

# here courses_names_tags is a list of beautifulsoup objects , so we iterate and print
for item in courses_names_tags:
    print(item.text)
    
# start wit the tag , then the class 
course_cards=soup.find_all('div', class_ = 'card')
for course in course_cards:
    course_name = course.h5.text  # do .tag.thing u want
    course_price = course.a.text.split()[-1]  
    print(course_name, course_price)

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <link crossorigin="anonymous" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" integrity="sha384-JcKb8q3iqJ61gNV9KGb8thSsNjpSL0n8PARn9HuZOnIxN0hoP+VmmDGMN5t9UJ0Z" rel="stylesheet"/>
  <title>
   Courses Overview
  </title>
 </head>
 <body>
  <h1>
   Welcome to Your Learning Path!
  </h1>
  <div class="card" id="card-python-intro">
   <div class="card-header">
    Python
   </div>
   <div class="card-body">
    <h5 class="card-title">
     Python for Beginners
    </h5>
    <p class="card-text">
     If you're just starting out with programming, this course will guide you through Python basics!
    </p>
    <a class="btn btn-primary" href="#">
     Start for $19
    </a>
   </div>
  </div>
  <div class="card" id="card-python-web-dev">
   <div class="card-header">
    Python
   </div>
   <div class="c

- Working with a real website

Web scraping involves making requests to websites to get their HTML, then parsing it to extract the desired data.

Tools like Beautiful Soup make it easier to parse the HTML compared to using basic text manipulation in Python.

In [2]:
import requests
from bs4 import BeautifulSoup

url = 'https://news.ycombinator.com/'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
print(response.status_code)

# Parse the response content using BeautifulSoup
soup = BeautifulSoup(response.text, 'lxml')

# Extract the titles and subtitles from the website
news_titles = soup.find_all('span', class_='titleline')
news_subtitles = soup.find_all('span', class_='subline')

# Link between the title and other info:
for title, subtitle in zip(news_titles, news_subtitles):
    item_site = title.text.split()[-1]  # Get the last part of the title as the site
    item_title = ' '.join(title.text.split()[:-1])  
    news_points = subtitle.find('span', class_='score').text
    news_author = subtitle.find('a', class_='hnuser').text
    news_time_posted = subtitle.find('span', class_='age').text
    news_comments = subtitle.find_all('a')[-1].text
    # Print combined info
    print(f"{item_title}, Site: {item_site}")
    print(f"Points: {news_points}, Author: {news_author}, Time: {news_time_posted}, Comments: {news_comments}\n")

200
Just: Just a Command Runner, Site: (just.systems)
Points: 250 points, Author: thunderbong, Time: 5 hours ago, Comments: 159 comments

Raspberry Pi 5 now supports Valve's Steam Link, Site: (raspberrypi.com)
Points: 126 points, Author: Venn1, Time: 2 hours ago, Comments: 31 comments

MIT largest open-source car design dataset, incl aerodynamics, to speed design, Site: (news.mit.edu)
Points: 114 points, Author: toss1, Time: 7 hours ago, Comments: 32 comments

I algorithmically donated $5000 to Open Source, Site: (kvinogradov.com)
Points: 260 points, Author: lorey, Time: 10 hours ago, Comments: 51 comments

The Startup Trap (2013), Site: (cleancoder.com)
Points: 31 points, Author: sandwichsphinx, Time: 3 hours ago, Comments: 24 comments

Beekeepers halt honey awards over fraud in global supply chain, Site: (theguardian.com)
Points: 115 points, Author: a_w, Time: 6 hours ago, Comments: 108 comments

An EPYC Exclusive for Azure: AMD's MI300C – By George Cozma, Site: (chipsandcheese.com)


- Adding a Filter for Minimum Points

In [3]:
# Ensure valid input for min_points
while True:
    try:
        min_points = int(input('Enter the minimum number of points: '))
        break  # Exit the loop if a valid integer is provided
    except ValueError:
        print("Please enter a valid number.")

def find_news(min_points):
    '''
    logic to filter the articles based on points :
    
    for title, subtitle in zip(news_titles, news_subtitles):
    
        # We get the number of points and turn them into an integer 
        news_points = int(subtitle.find('span', class_='score').text.strip().split()[0]) 
        
        # select article with more points
        if news_points >= min_points:
        
        # parse info
            item_site = title.text.split()[-1]  # Get the site (last part of the title)
            item_title = ' '.join(title.text.split()[:-1])  # Join everything except the last word (site)
    '''
    pass

find_news(min_points)

- Saving data into a csv file

In [None]:
with open('scraped_data.csv', 'w', newline='') as f:
    f.write('Title,website,points,author,time posted,number of comments\n')

    # Replace special characters with a space
    def clean_text(text):
        return text.replace(',', ' ').replace('�', ' ').replace('–', ' ').replace('•', ' ')
    
    for title, subtitle in zip(news_titles, news_subtitles):
        news_points = int(subtitle.find('span', class_='score').text.strip().split()[0])  
        
        if news_points >= min_points:
            item_site = clean_text(title.text.split()[-1])  # Clean site (last part of the title)
            item_title = clean_text(' '.join(title.text.split()[:-1]))  # Clean title
            news_author = clean_text(subtitle.find('a', class_='hnuser').text)  # Clean author
            news_time_posted = clean_text(subtitle.find('span', class_='age').text)  # Clean time posted
            news_comments = clean_text(subtitle.find_all('a')[-1].text.strip())
            news_comments = int(news_comments.split()[0])  # Ensure we only take the numeric part
            
            # Write each piece of data in a new row
            f.write(f"{item_title},{item_site},{news_points},{news_author},{news_time_posted},{news_comments}\n")
