Import library

In [None]:
import requests
from bs4 import BeautifulSoup

In [1]:
url = "https://en.wikipedia.org/wiki/Web_crawler"

response = requests.get(url)

if response.status_code == 200:
    html = response.content
    soup = BeautifulSoup(html, "html.parser")
    title = soup.title.text.strip()
    print(f"Title: {title}")
    links = soup.find_all("a")
    print(f"Number of links: {len(links)}")
    for link in links:
        href = link.get("href")
        if href and "http" in href:
            print(href)
else:
    print(f"Request failed with status code {response.status_code}")

Title: Web crawler - Wikipedia
Number of links: 712
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
https://af.wikipedia.org/wiki/Webkruiper
https://ar.wikipedia.org/wiki/%D8%B2%D8%A7%D8%AD%D9%81_%D8%A7%D9%84%D8%B4%D8%A8%D9%83%D8%A9
https://bar.wikipedia.org/wiki/Webcrawler
https://ca.wikipedia.org/wiki/Aranya_web
https://cs.wikipedia.org/wiki/Web_crawler
https://cy.wikipedia.org/wiki/Ymgripiwr_gwe
https://ary.wikipedia.org/wiki/%D8%B1%D8%AA%D9%8A%D9%84%D8%A9_(%D8%A8%D9%88%D8%AA)
https://de.wikipedia.org/wiki/Webcrawler
https://el.wikipedia.org/wiki/%CE%91%CE%BD%CE%B9%CF%87%CE%BD%CE%B5%CF%85%CF%84%CE%AE%CF%82_%CE%B9%CF%83%CF%84%CE%BF%CF%8D
https://es.wikipedia.org/wiki/Ara%C3%B1a_web
https://eu.wikipedia.org/wiki/Web_crawler
https://fa.wikipedia.org/wiki/%D8%AE%D8%B2%D9%86%D8%AF%D9%87_%D9%88%D8%A8
https://fr.wikipedia.org/wiki/Robot_d%27indexation
https://ko.wikipedia.org/wiki/%EC%9B%B9_%ED

This code is a simple web scraper that fetches the contents of a Wikipedia page on web crawlers and prints the page's title, the number of links on the page, and all external links found on the page.

The code first sends a GET request to the specified URL using the requests library and checks if the response code is 200, which indicates that the request was successful. If the request was successful, the code uses the BeautifulSoup library to parse the HTML content of the page and extract the page's title.

Next, the code uses find_all() method to find all the a tags on the page which represent links. It then prints the total number of links found on the page. The code then loops through all the links and checks if each link has an href attribute and if it is an external link. If both conditions are met, the link is printed.

If the request was not successful, the code prints an error message indicating the status code returned by the server.

Now, let's store it into csv files


In [4]:
import csv

if response.status_code == 200:
    html = response.content
    soup = BeautifulSoup(html, "html.parser")
    title = soup.title.text.strip()
    links = soup.find_all("a")
    
    with open('output.csv', mode='w', encoding='utf-8', newline='') as csv_file:
        fieldnames = ['link']
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        
        for link in links:
            href = link.get("href")
            if href and "http" in href:
                writer.writerow({'link': href})
else:
    print(f"Request failed with status code {response.status_code}")

This code will create a new CSV file named "output.csv" in the same directory as the Python script, and it will write all the links found on the web page to the CSV file.