# Web Scraping

What is web scraping all about?

Imagine that one day, out of the blue, you find yourself thinking “Gee, I wonder who the five most popular mathematicians are?”

You do a bit of thinking, and you get the idea to use Wikipedia’s XTools to measure the popularity of a mathematician by equating popularity with pageviews. For example, look at the page on Henri Poincaré. There, you can see that Poincaré’s pageviews for the last 60 days are, as of December 2017, around 32,000.

Next, you Google “famous mathematicians” and find this resource that lists 100 names. Now you have a page listing mathematicians’ names as well as a website that provides information about how “popular” that mathematician is. Now what?

This is where Python and web scraping come in. Web scraping is about downloading structured data from the web, selecting some of that data, and passing along what you selected to another process.

In this tutorial, you will be writing a Python program that downloads the list of 100 mathematicians and their XTools pages, selects data about their popularity, and finishes by telling us the top 5 most popular mathematicians of all time! Let’s get started.

<br><font color="blue">
Qu'est-ce que le web scraping ?

<font color="blue">Imaginez qu'un jour, sans crier gare, vous vous demandiez : « Tiens, je me demande qui sont les cinq mathématiciens les plus populaires ?»

<font color="blue">En réfléchissant un peu, l'idée vous vient d'utiliser les XTools de Wikipédia pour mesurer la popularité d'un mathématicien en liant popularité et pages vues. Par exemple, regardez la page sur Henri Poincaré. Vous y constaterez que les pages de Poincaré sur les 60 derniers jours, en décembre 2017, s'élevaient à environ 32 000.

<font color="blue">Ensuite, vous recherchez « mathématiciens célèbres » sur Google et trouvez une ressource qui répertorie 100 noms. Vous obtenez alors une page listant les noms de mathématiciens, ainsi qu'un site web fournissant des informations sur la « popularité » de chaque mathématicien. Et maintenant ?

<font color="blue">C'est là qu'interviennent Python et le web scraping. Le web scraping consiste à télécharger des données structurées depuis le web, à en sélectionner une partie et à la transmettre à un autre processus.

<font color="blue">Dans ce tutoriel, vous allez écrire un programme Python qui télécharge la liste de 100 mathématiciens et leurs pages XTools, sélectionne des données sur leur popularité et nous indique les 5 mathématiciens les plus populaires de tous les temps ! C'est parti !

You will need to install only these two packages:

requests for performing your HTTP requests
BeautifulSoup4 for handling all of your HTML processing

<font color="blue">
Vous n'aurez besoin d'installer que ces deux packages :

requests pour exécuter vos requêtes HTTP
BeautifulSoup4 pour gérer l'ensemble de votre traitement HTML

In [1]:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.connection-sphere.com/en/blogs/883471274311/?lang=en').text

In [2]:
soup = BeautifulSoup(source, 'lxml')
soup

<!DOCTYPE html>
<html lang="en">
<head>
<!-- Required meta tags -->
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="Connect with professionals, post jobs, list real estate, and grow your business globally. Join Connection Sphere today." name="description"/>
<!-- Add this in the head section -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML"></script>
<!-- Cronitor RUM -->
<script async="" src="https://rum.cronitor.io/script.js"></script>
<script>
    window.cronitor = window.cronitor || function() { (window.cronitor.q = window.cronitor.q || []).push(arguments); };
    cronitor('config', { clientKey: 'ebd61332260a07c66d6285ffce1f457b' });
</script>
<script>
MathJax.Hub.Config({
    tex2jax: {
        inlineMath: [['\\(', '\\)']],
        displayMath: [['\\[', '\\]']],
        processEscapes: true,
        processEnvironments: true
    },
    "HTML-CSS": { linebreaks: { aut

In [3]:
# use prettify to make the site code well organized
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <!-- Required meta tags -->
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="Connect with professionals, post jobs, list real estate, and grow your business globally. Join Connection Sphere today." name="description"/>
  <!-- Add this in the head section -->
  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML">
  </script>
  <!-- Cronitor RUM -->
  <script async="" src="https://rum.cronitor.io/script.js">
  </script>
  <script>
   window.cronitor = window.cronitor || function() { (window.cronitor.q = window.cronitor.q || []).push(arguments); };
    cronitor('config', { clientKey: 'ebd61332260a07c66d6285ffce1f457b' });
  </script>
  <script>
   MathJax.Hub.Config({
    tex2jax: {
        inlineMath: [['\\(', '\\)']],
        displayMath: [['\\[', '\\]']],
        processEscapes: true,
        processEnvironments: true
    },
    

In [4]:
csphere = requests.get('https://www.connection-sphere.com/en').text
csphere = BeautifulSoup(csphere, 'lxml')
csphere

<!DOCTYPE html>
<html lang="en">
<head>
<!-- Required meta tags -->
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="Connect with professionals, post jobs, list real estate, and grow your business globally. Join Connection Sphere today." name="description"/>
<!-- Add this in the head section -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML"></script>
<!-- Cronitor RUM -->
<script async="" src="https://rum.cronitor.io/script.js"></script>
<script>
    window.cronitor = window.cronitor || function() { (window.cronitor.q = window.cronitor.q || []).push(arguments); };
    cronitor('config', { clientKey: 'ebd61332260a07c66d6285ffce1f457b' });
</script>
<script>
MathJax.Hub.Config({
    tex2jax: {
        inlineMath: [['\\(', '\\)']],
        displayMath: [['\\[', '\\]']],
        processEscapes: true,
        processEnvironments: true
    },
    "HTML-CSS": { linebreaks: { aut

In [5]:
#inspect the website first to see the structure before you start anything
sections = csphere.find_all('section')
sections

[<section class="py-5 rounded">
 <div class="test-hero-section signup py-5 rounded">
 <video autoplay="" id="background-video" loop="" muted="" playsinline="">
 <source src="/static/videos/homepage.mp4" style="width: 100%; height: 100%;" type="video/mp4">
 </source></video>
 <div class="text-white text-center py-5 test-hero-content">
 <div class="container">
 <!-- Header section -->
 <div class="mb-4">
 <h4 class="display-5 fw-bold mb-3">Welcome To CSphere!</h4>
 </div>
 <!-- Content section with some vertical spacing -->
 <div class="mt-4">
 <h6 class="fw-bold mb-3 h5">Unlock your potential:</h6>
 <p class="lead mx-auto" style="max-width: 700px;">
                 Connect with experts, empower your business, and build a stronger future through meaningful connections.
             </p>
 </div>
 </div>
 </div>
 </div>
 </section>,
 <section class="rounded card1 py-3" id="features-overview">
 <div class="container-fluid text-center aboutus">
 <h2>Features</h2>
 </div>
 <div class="contai

In [6]:
texts = csphere.find_all('p')
texts

[<p class="lead mx-auto" style="max-width: 700px;">
                 Connect with experts, empower your business, and build a stronger future through meaningful connections.
             </p>,
 <p>Connect with experts and businesses today!</p>,
 <p class="content_p2">Stay informed and ahead of the curve with our industry-leading blogs. Authored by experts, our blogs provide valuable insights, latest trends, and actionable strategies. Whether you're a business looking for growth tips or an expert sharing knowledge, our platform is your gateway to thought leadership.</p>,
 <p class="content_p">Unlock exclusive deals and opportunities only found on our platform. Connect with businesses and experts globally to access special offers, partnerships, and collaborations. This is your chance to leverage competitive advantages and foster business growth through strategic deals.</p>,
 <p class="content_p2">Discover and connect with industry leaders and experts. Our expert profiles showcase skills,

In [7]:
for i in csphere.find_all('p'):
    print(csphere.find_all('p'))

[<p class="lead mx-auto" style="max-width: 700px;">
                Connect with experts, empower your business, and build a stronger future through meaningful connections.
            </p>, <p>Connect with experts and businesses today!</p>, <p class="content_p2">Stay informed and ahead of the curve with our industry-leading blogs. Authored by experts, our blogs provide valuable insights, latest trends, and actionable strategies. Whether you're a business looking for growth tips or an expert sharing knowledge, our platform is your gateway to thought leadership.</p>, <p class="content_p">Unlock exclusive deals and opportunities only found on our platform. Connect with businesses and experts globally to access special offers, partnerships, and collaborations. This is your chance to leverage competitive advantages and foster business growth through strategic deals.</p>, <p class="content_p2">Discover and connect with industry leaders and experts. Our expert profiles showcase skills, exper

In [13]:
for i in csphere.find_all('p'):
    print(i.text)


                Connect with experts, empower your business, and build a stronger future through meaningful connections.
            
Connect with experts and businesses today!
Stay informed and ahead of the curve with our industry-leading blogs. Authored by experts, our blogs provide valuable insights, latest trends, and actionable strategies. Whether you're a business looking for growth tips or an expert sharing knowledge, our platform is your gateway to thought leadership.
Unlock exclusive deals and opportunities only found on our platform. Connect with businesses and experts globally to access special offers, partnerships, and collaborations. This is your chance to leverage competitive advantages and foster business growth through strategic deals.
Discover and connect with industry leaders and experts. Our expert profiles showcase skills, experiences, and achievements, making it easy for businesses to find the right expertise. Experts can expand their reach, showcase their portfol

In [16]:
paragraphs = [p.get_text() for p in csphere.find_all('p')]
for text in paragraphs:
    print(text)


                Connect with experts, empower your business, and build a stronger future through meaningful connections.
            
Connect with experts and businesses today!
Stay informed and ahead of the curve with our industry-leading blogs. Authored by experts, our blogs provide valuable insights, latest trends, and actionable strategies. Whether you're a business looking for growth tips or an expert sharing knowledge, our platform is your gateway to thought leadership.
Unlock exclusive deals and opportunities only found on our platform. Connect with businesses and experts globally to access special offers, partnerships, and collaborations. This is your chance to leverage competitive advantages and foster business growth through strategic deals.
Discover and connect with industry leaders and experts. Our expert profiles showcase skills, experiences, and achievements, making it easy for businesses to find the right expertise. Experts can expand their reach, showcase their portfol

In [18]:
for i in csphere.find_all('section'):
    print(i.text)










Welcome To CSphere!



Unlock your potential:

                Connect with experts, empower your business, and build a stronger future through meaningful connections.
            







Features



Blogs
Deals
Training Courses
Expert and Company Listing
Job Posting
Private Messaging




Join Our Network
Connect with experts and businesses today!
Sign Up





Join Our Network
Connect with experts and businesses today!
Sign Up









Blogs
Stay informed and ahead of the curve with our industry-leading blogs. Authored by experts, our blogs provide valuable insights, latest trends, and actionable strategies. Whether you're a business looking for growth tips or an expert sharing knowledge, our platform is your gateway to thought leadership.

Search for New Deals   












Exclusive Deals
Unlock exclusive deals and opportunities only found on our platform. Connect with businesses and experts globally to access special offers, partnerships, and collaborations. This is your c

In [20]:
for i in csphere.find_all('div'):
    print(i.text)





 NETWORKING B2B 








 CSphere



CSphere



AI Solutions

CONNECTIONS


Opportunities
Resources
Solutions
Trainings


JOIN US
LOGIN

ORGANIZATION


Who We Are
FAQs

More



Send Us Your project
Contact Us





LANGUAGES




                      English
                  



                      français
                  

















 CSphere



CSphere



AI Solutions

CONNECTIONS


Opportunities
Resources
Solutions
Trainings


JOIN US
LOGIN

ORGANIZATION


Who We Are
FAQs

More



Send Us Your project
Contact Us





LANGUAGES




                      English
                  



                      français
                  












 CSphere



CSphere



AI Solutions

CONNECTIONS


Opportunities
Resources
Solutions
Trainings


JOIN US
LOGIN

ORGANIZATION


Who We Are
FAQs

More



Send Us Your project
Contact Us





LANGUAGES




                      English
                  



                      français
                  







CSphere














In [21]:
cnn = requests.get('https://cnn.com').text
cnn = BeautifulSoup(cnn, 'lxml')
cnn

<!DOCTYPE html>
<html data-layout-uri="cms.cnn.com/_layouts/layout-homepage/instances/homepage-domestic@published" data-uri="cms.cnn.com/_pages/clg34ol9u000047nodabud1o2@published" lang="en">
<head>
<link href="//tpc.googlesyndication.com" rel="dns-prefetch"/>
<link href="//tpc.googlesyndication.com" rel="preconnect"/>
<link href="//pagead2.googlesyndication.com" rel="dns-prefetch"/>
<link href="//pagead2.googlesyndication.com" rel="preconnect"/>
<link href="//www.googletagservices.com" rel="dns-prefetch"/>
<link href="//www.googletagservices.com" rel="preconnect"/>
<link href="//www.google.com" rel="dns-prefetch"/>
<link href="//www.google.com" rel="preconnect"/>
<link href="//c.amazon-adsystem.com" rel="dns-prefetch"/>
<link href="//c.amazon-adsystem.com" rel="preconnect"/>
<link href="//ib.adnxs.com" rel="dns-prefetch"/>
<link href="//ib.adnxs.com" rel="preconnect"/>
<link href="//cdn.adsafeprotected.com" rel="dns-prefetch"/>
<link href="//cdn.adsafeprotected.com" rel="preconnect"/>

In [22]:
csphere.find_all('a')

[<a href="/en/"> <i class="bx bx-menu"></i><i class="bx bx-globe"></i>CSphere</a>,
 <a class="" href="/en/ai-solutions/">AI Solutions</a>,
 <a href="#">CONNECTIONS</a>,
 <a class="" href="/en/opportunities/">Opportunities</a>,
 <a class="" href="/en/resources/">Resources</a>,
 <a class="" href="/en/solutions/">Solutions</a>,
 <a class="" href="/en/trainings/">Trainings</a>,
 <a class="" href="/en/register/">JOIN US</a>,
 <a class="" href="/en/login/">LOGIN</a>,
 <a href="#">ORGANIZATION</a>,
 <a class="" href="/en/aboutus/">Who We Are</a>,
 <a class="" href="/en/faqs/">FAQs</a>,
 <a href="#">More</a>,
 <a class="" href="#">Send Us Your project</a>,
 <a class="" href="/en/contacts/create/">Contact Us</a>,
 <a href="#">LANGUAGES</a>,
 <a class="" href="/en/">
                       English
                   </a>,
 <a class="" href="/fr/">
                       français
                   </a>,
 <a href="/en/">Home</a>,
 <a href="/en/contacts/create/">Contact</a>,
 <a href="/en/aboutus/

In [23]:
# all the links on the website
all_links = csphere.find_all("a")
for link in all_links:
    print(link.get("href"))

/en/
/en/ai-solutions/
#
/en/opportunities/
/en/resources/
/en/solutions/
/en/trainings/
/en/register/
/en/login/
#
/en/aboutus/
/en/faqs/
#
#
/en/contacts/create/
#
/en/
/fr/
/en/
/en/contacts/create/
/en/aboutus/
mailto:csphere@connection-sphere.com
mailto:csphere@connection-sphere.com
mailto:csphere.inc@gmail.com
/en/login/
/en/login/
/en/login/
/en/login/
/en/login/
/en/login/
/en/login/
/en/login/
/en/privacy_policy/


In [24]:
# Get the title
title = cnn.title
print(title)

<title>Breaking News, Latest News and Videos | CNN</title>


In [25]:
cnn.find_all('a')

[<a class="brand-logo__logo-link" data-zjs="click" data-zjs-component_id="https://www.cnn.com" data-zjs-component_text="Main Logo" data-zjs-component_type="icon" data-zjs-container_id="" data-zjs-container_type="navigation" data-zjs-destination_url="https://www.cnn.com" data-zjs-page_type="section" data-zjs-page_variant="landing_homepage" href="https://www.cnn.com" title="CNN logo">
 <span class="brand-logo__logo">
 <svg class="brand-logo__icon" fill="none" height="22" viewbox="0 0 46 22" width="46" xmlns="http://www.w3.org/2000/svg"><path clip-rule="evenodd" d="M6.10447 11.0001C6.10447 8.531 8.10642 6.52954 10.5752 6.52954H13.9675V3.99665H10.5476C6.68578 3.99665 3.5437 7.13824 3.5437 11.0003C3.5437 14.8619 6.68578 18.0039 10.5476 18.0039L17.1326 18.0037C17.5009 18.0037 17.797 17.6415 17.797 17.3414V4.36936C17.797 3.65427 18.2455 3.05144 18.9127 2.86949C19.482 2.71444 20.2803 2.87169 20.8136 3.77851C20.8386 3.82033 22.4563 6.60902 24.922 10.8592C26.8569 14.1962 28.8589 17.6469 28.8951 

In [26]:
# all the links on the website
all_links = cnn.find_all("a")
for link in all_links:
    print(link.get("href"))

https://www.cnn.com
https://www.cnn.com/us
https://www.cnn.com/world
https://www.cnn.com/politics
https://www.cnn.com/business
https://www.cnn.com/health
https://www.cnn.com/entertainment
https://www.cnn.com/cnn-underscored
https://www.cnn.com/style
https://www.cnn.com/travel
https://www.cnn.com/sports
https://www.cnn.com/science
https://www.cnn.com/climate
https://www.cnn.com/weather
https://www.cnn.com/world/europe/ukraine
https://www.cnn.com/world/middleeast/israel
https://www.cnn.com/games
https://www.cnn.com/cnn-underscored/deals/black-friday
None
https://www.cnn.com/us
https://www.cnn.com/world
https://www.cnn.com/politics
https://www.cnn.com/business
https://www.cnn.com/health
https://www.cnn.com/entertainment
https://www.cnn.com/cnn-underscored
https://www.cnn.com/style
https://www.cnn.com/travel
https://www.cnn.com/sports
https://www.cnn.com/science
https://www.cnn.com/climate
https://www.cnn.com/weather
https://www.cnn.com/world/europe/ukraine
https://www.cnn.com/world/middle

In [27]:
# Print the first 5 rows for sanity check
rows = cnn.find_all('body')
print(rows[:5])

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [29]:
matchdivtext = cnn.div.text
print(matchdivtext)






CNN values your feedback




                                                        1. How relevant is this ad to you?
                                                






























                                                2. Did you encounter any technical issues?
                                        











                                                                        Video player was slow to load content
                                                                        



                                                                        Video content never loaded
                                                                        



                                                                        Ad froze or did not finish loading
                                                                        



                                                                        Video content did not start after ad


In [31]:
matchdiv = cnn.div
print(matchdiv)

<div class="header__wrapper-inner" data-editable="header">
<div class="ad-feedback__modal modal__overlay" data-uri="cms.cnn.com/_components/ad-feedback/instances/cnn-v1@published" id="ad-feedback__modal-overlay" style="display:none">
<div class="ad-feedback__container">
<form class="ad-feedback__form">
<div class="ad-feedback__heading">
<h3 class="ad-feedback__heading__text">CNN values your feedback</h3>
<div class="ad-feedback__heading__close" id="ad-feedback__close-icon"></div>
</div>
<div class="ad-feedback__content-container" data-sentiment="ad">
<div class="ad-feedback__question-container">
                                                        1. How relevant is this ad to you?
                                                </div>
<div class="ad-feedback__answers-container">
<div class="ad-feedback__emoji-container">
<input aria-label="Bad" class="ad-feedback__emoji-radio-input" id="ad-feedback__0-bad" name="ad" type="radio" value="1"/>
<label class="ad-feedback__emoji-base ad-

In [32]:
rows = cnn.find_all('section')
print(rows[:5])

[<section class="layout__info layout-homepage__info" data-editable="topLayout" data-track-zone="topLayout"><div class="alerts" data-uri="cms.cnn.com/_components/alerts/instances/cnn-v1@published"></div>
</section>, <section class="layout__top layout-homepage__top" data-editable="top" data-track-zone="top"></section>, <section class="layout__wrapper layout-homepage__wrapper">
<section class="layout__main layout-homepage__main" data-editable="main" data-reorderable="main" data-track-zone="main"> <div class="section" data-drag-disable="true" data-unselectable="true" data-uri="cms.cnn.com/_components/section/instances/clg34oloy007k47noh5d2dztb@published" role="main">
<section class="body tabcontent active" data-tabcontent="Content">
<div data-editable="items" data-reorderable-component="items">
<div class="scope" data-component-name="scope" data-uri="cms.cnn.com/_components/scope/instances/clg34olor000e47no0vl72kpx@published">
<div data-editable="items" data-reorderable-component="items">


In [33]:
matchdiv = cnn.section.text
print(matchdiv)





In [34]:
htmls = soup.find_all('html')
print(htmls[:5])

[<html lang="en">
<head>
<!-- Required meta tags -->
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="Connect with professionals, post jobs, list real estate, and grow your business globally. Join Connection Sphere today." name="description"/>
<!-- Add this in the head section -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML"></script>
<!-- Cronitor RUM -->
<script async="" src="https://rum.cronitor.io/script.js"></script>
<script>
    window.cronitor = window.cronitor || function() { (window.cronitor.q = window.cronitor.q || []).push(arguments); };
    cronitor('config', { clientKey: 'ebd61332260a07c66d6285ffce1f457b' });
</script>
<script>
MathJax.Hub.Config({
    tex2jax: {
        inlineMath: [['\\(', '\\)']],
        displayMath: [['\\[', '\\]']],
        processEscapes: true,
        processEnvironments: true
    },
    "HTML-CSS": { linebreaks: { automatic: true } 

In [35]:
links = cnn.find_all('link')
print(links)

[<link href="//tpc.googlesyndication.com" rel="dns-prefetch"/>, <link href="//tpc.googlesyndication.com" rel="preconnect"/>, <link href="//pagead2.googlesyndication.com" rel="dns-prefetch"/>, <link href="//pagead2.googlesyndication.com" rel="preconnect"/>, <link href="//www.googletagservices.com" rel="dns-prefetch"/>, <link href="//www.googletagservices.com" rel="preconnect"/>, <link href="//www.google.com" rel="dns-prefetch"/>, <link href="//www.google.com" rel="preconnect"/>, <link href="//c.amazon-adsystem.com" rel="dns-prefetch"/>, <link href="//c.amazon-adsystem.com" rel="preconnect"/>, <link href="//ib.adnxs.com" rel="dns-prefetch"/>, <link href="//ib.adnxs.com" rel="preconnect"/>, <link href="//cdn.adsafeprotected.com" rel="dns-prefetch"/>, <link href="//cdn.adsafeprotected.com" rel="preconnect"/>, <link href="//securepubads.g.doubleclick.net" rel="dns-prefetch"/>, <link href="//securepubads.g.doubleclick.net" rel="preconnect"/>, <link href="//segment-data-us-east.zqtk.net" rel=

In [39]:
import requests
import pandas as pd
from datetime import datetime

def scrape_youtube_videos_api(search_query="python programming", max_results=10):
    """
    Scrape YouTube videos using YouTube Data API
    You need to get an API key from Google Cloud Console
    """
    API_KEY = "AIzaSyC3TwlZshSZOp4-hN7a0ChJ-3WTY5X5YQM"  # Replace with your actual API key
    
    # YouTube Data API endpoint
    url = "https://www.googleapis.com/youtube/v3/search"
    
    params = {
        'part': 'snippet',
        'q': search_query,
        'type': 'video',
        'maxResults': max_results,
        'key': API_KEY
    }
    
    try:
        response = requests.get(url, params=params)
        response.raise_for_status()
        
        data = response.json()
        videos_data = []
        
        for item in data.get('items', []):
            video_id = item['id']['videoId']
            snippet = item['snippet']
            
            # Get video statistics
            stats_url = "https://www.googleapis.com/youtube/v3/videos"
            stats_params = {
                'part': 'statistics,snippet,contentDetails',
                'id': video_id,
                'key': API_KEY
            }
            
            stats_response = requests.get(stats_url, params=stats_params)
            stats_data = stats_response.json()
            
            if stats_data.get('items'):
                video_stats = stats_data['items'][0]
                
                video_info = {
                    'title': snippet.get('title', 'N/A'),
                    'description': snippet.get('description', 'N/A')[:200] + '...',  # Truncate long descriptions
                    'views': video_stats['statistics'].get('viewCount', 'N/A'),
                    'date_posted': snippet.get('publishedAt', 'N/A'),
                    'video_link': f"https://www.youtube.com/watch?v={video_id}",
                    'channel_title': snippet.get('channelTitle', 'N/A')
                }
                videos_data.append(video_info)
        
        return videos_data
        
    except Exception as e:
        print(f"Error: {e}")
        return []

# Usage
if __name__ == "__main__":
    # You need to get an API key from https://console.cloud.google.com/
    API_KEY = "AIzaSyC3TwlZshSZOp4-hN7a0ChJ-3WTY5X5YQM"
    
    videos = scrape_youtube_videos_api("python tutorial", 5)
    
    if videos:
        df = pd.DataFrame(videos)
        print(df)
        df.to_csv('youtube_videos.csv', index=False)
        print("Data saved to youtube_videos.csv")
    else:
        print("No data retrieved")

Error: 403 Client Error: Forbidden for url: https://www.googleapis.com/youtube/v3/search?part=snippet&q=python+tutorial&type=video&maxResults=5&key=AIzaSyC3TwlZshSZOp4-hN7a0ChJ-3WTY5X5YQM
No data retrieved


In [40]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
import re

def setup_driver():
    """Setup Chrome driver with options"""
    chrome_options = Options()
    chrome_options.add_argument("--headless")  # Run in background
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
    
    driver = webdriver.Chrome(options=chrome_options)
    return driver

def scrape_youtube_search(search_query="python programming", max_videos=5):
    """
    Scrape YouTube search results using Selenium
    Note: This may break if YouTube changes its HTML structure
    """
    driver = setup_driver()
    videos_data = []
    
    try:
        # Navigate to YouTube search
        search_url = f"https://www.youtube.com/results?search_query={search_query.replace(' ', '+')}"
        driver.get(search_url)
        
        # Wait for page to load
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "ytd-video-renderer"))
        )
        
        # Find video elements
        video_elements = driver.find_elements(By.CSS_SELECTOR, "ytd-video-renderer")[:max_videos]
        
        for i, video_element in enumerate(video_elements):
            try:
                print(f"Processing video {i+1}...")
                
                # Extract title
                title_element = video_element.find_element(By.CSS_SELECTOR, "#video-title")
                title = title_element.text
                video_link = title_element.get_attribute("href")
                
                # Extract metadata
                metadata_element = video_element.find_element(By.CSS_SELECTOR, "#metadata-line")
                metadata_text = metadata_element.text
                
                # Parse views and date from metadata
                views, date_posted = parse_metadata(metadata_text)
                
                # Get description (requires clicking on video)
                description = get_video_description(driver, video_element)
                
                video_data = {
                    'title': title,
                    'description': description,
                    'views': views,
                    'date_posted': date_posted,
                    'video_link': video_link
                }
                
                videos_data.append(video_data)
                time.sleep(1)  # Be respectful
                
            except Exception as e:
                print(f"Error processing video {i+1}: {e}")
                continue
                
    except Exception as e:
        print(f"Error during scraping: {e}")
    finally:
        driver.quit()
    
    return videos_data

def parse_metadata(metadata_text):
    """Parse views and date from metadata string"""
    parts = metadata_text.split('\n')
    views = parts[0] if len(parts) > 0 else "N/A"
    date_posted = parts[1] if len(parts) > 1 else "N/A"
    return views, date_posted

def get_video_description(driver, video_element):
    """Get video description by clicking on the video"""
    try:
        # Click on video to open description (this is simplified)
        video_element.click()
        time.sleep(2)
        
        # Try to find description
        description_element = driver.find_element(By.CSS_SELECTOR, "#description")
        description = description_element.text[:200] + "..." if description_element.text else "N/A"
        
        # Go back to search results
        driver.back()
        time.sleep(2)
        
        return description
    except:
        return "Description not available"

# Alternative: Scrape from YouTube trending page
def scrape_youtube_trending(max_videos=10):
    """Scrape videos from YouTube trending page"""
    driver = setup_driver()
    videos_data = []
    
    try:
        driver.get("https://www.youtube.com/feed/trending")
        time.sleep(5)
        
        video_elements = driver.find_elements(By.CSS_SELECTOR, "ytd-video-renderer")[:max_videos]
        
        for video_element in video_elements:
            try:
                title_element = video_element.find_element(By.CSS_SELECTOR, "#video-title")
                title = title_element.text
                video_link = title_element.get_attribute("href")
                
                # Note: Getting detailed info requires more complex interaction
                video_data = {
                    'title': title,
                    'description': 'Use API for full description',
                    'views': 'Use API for accurate views',
                    'date_posted': 'Use API for accurate date',
                    'video_link': video_link
                }
                
                videos_data.append(video_data)
                
            except Exception as e:
                print(f"Error processing video: {e}")
                continue
                
    finally:
        driver.quit()
    
    return videos_data

# Main execution
if __name__ == "__main__":
    print("YouTube Scraper")
    print("1. Using Selenium (limited functionality)")
    print("2. Using API (recommended - requires API key)")
    
    choice = input("Choose method (1 or 2): ")
    
    if choice == "1":
        videos = scrape_youtube_trending(5)
    elif choice == "2":
        # You need to set up API key first
        videos = scrape_youtube_videos_api("python programming", 5)
    else:
        print("Invalid choice")
        videos = []
    
    if videos:
        df = pd.DataFrame(videos)
        print("\nScraped Videos:")
        print(df)
        df.to_csv('youtube_videos.csv', index=False)
        print("\nData saved to youtube_videos.csv")
    else:
        print("No videos were scraped")

YouTube Scraper
1. Using Selenium (limited functionality)
2. Using API (recommended - requires API key)
Choose method (1 or 2): 1
No videos were scraped


In [30]:
#loading the necessary packages
from requests import get # requests.get()
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

Your first task will be to download web pages. The requests package comes to the rescue. It aims to be an easy-to-use tool for doing all things HTTP in Python, and it doesn’t dissappoint. In this tutorial, you will need only the requests.get() function, but you should definitely checkout the full documentation when you want to go further.

First, here’s your function:<br>
<font color="blue">
Votre première tâche consistera à télécharger des pages web. Le package requests est là pour vous aider. Il se veut un outil facile à utiliser pour tout ce qui concerne HTTP en Python, et il ne déçoit pas. Dans ce tutoriel, vous n'aurez besoin que de la fonction requests.get(), mais n'hésitez pas à consulter la documentation complète pour aller plus loin.

<font color="blue">Voici d'abord votre fonction :

In [31]:
def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)

The simple_get() function accepts a single url argument. It then makes a GET request to that URL. If nothing goes wrong, you end up with the raw HTML content for the page you requested. If there were any problems with your request (like the URL is bad, or the remote server is down), then your function returns None.

You may have noticed the use of the closing() function in your definition of simple_get(). The closing() function ensures that any network resources are freed when they go out of scope in that with block. Using closing() like that is good practice and helps to prevent fatal errors and network timeouts.

You can test simple_get() like this:<br>
    
<font color="blue">
La fonction simple_get() accepte un seul argument d'URL. Elle effectue ensuite une requête GET vers cette URL. Si tout se passe bien, vous obtenez le contenu HTML brut de la page demandée. En cas de problème avec votre requête (URL incorrecte ou serveur distant en panne), votre fonction renvoie None.

Vous avez peut-être remarqué l'utilisation de la fonction closing() dans votre définition de simple_get(). Cette fonction garantit la libération des ressources réseau lorsqu'elles sont hors de portée dans ce bloc with. Utiliser closing() de cette manière est une bonne pratique et permet d'éviter les erreurs fatales et les dépassements de délai réseau.

Vous pouvez tester simple_get() comme suit :

In [32]:
raw_html = simple_get('https://www.connection-sphere.com/en/blogs/883471274311/?lang=en')
len(raw_html)

64438

In [33]:
no_html = simple_get('https://realpython.com/blog/nope-not-gonna-find-it')
no_html is None

True

Once you have raw HTML in front of you, you can start to select and extract. For this purpose, you will be using BeautifulSoup. The BeautifulSoup constructor parses raw HTML strings and produces an object that mirrors the HTML document’s structure. The object includes a slew of methods to select, view, and manipulate DOM nodes and text content.

Consider the following quick and contrived example of an HTML document:<br>
    
<font color="blue">
Une fois le code HTML brut sous les yeux, vous pouvez commencer à le sélectionner et à l'extraire. Pour cela, vous utiliserez BeautifulSoup. Le constructeur BeautifulSoup analyse les chaînes HTML brutes et produit un objet qui reproduit la structure du document HTML. Cet objet inclut de nombreuses méthodes permettant de sélectionner, d'afficher et de manipuler les nœuds DOM et le contenu textuel.

Prenons l'exemple rapide et artificiel suivant d'un document HTML :

Now that you have given the select() method in BeautifulSoup a short test drive, how do you find out what to supply to select()? The fastest way is to step out of Python and into your web browser’s developer tools. You can use your browser to examine the document in some detail. I usually look for id or class element attributes or any other information that uniquely identifies the information I want to extract.

To make matters concrete, turn to the list of mathematicians you saw earlier. If you spend a minute or two looking at this page’s source, you can see that each mathematician’s name appears inside the text content of an <li> tag. To make matters even simpler, <li> tags on this page seem to contain nothing but names of mathematicians.

Here’s a quick look with Python:<br>
<font color="blue">
Maintenant que vous avez testé brièvement la méthode select() de BeautifulSoup, comment déterminer ce qu'il faut fournir à select() ? Le plus rapide est de quitter Python et d'accéder aux outils de développement de votre navigateur web. Vous pouvez utiliser votre navigateur pour examiner le document en détail. Je recherche généralement les attributs d'élément id ou class, ou toute autre information identifiant de manière unique les informations que je souhaite extraire.
<font color="blue">
Pour être plus concret, reportez-vous à la liste des mathématiciens que vous avez vue précédemment. Si vous prenez une minute ou deux pour consulter le code source de cette page, vous constaterez que le nom de chaque mathématicien apparaît dans le texte d'une balise <li>. Pour simplifier encore les choses, les balises <li> sur cette page semblent ne contenir que des noms de mathématiciens.
<br><font color="blue">
Voici un aperçu rapide avec Python :
    

In [34]:
#Using BeautifulSoup to Get Mathematician Names from the website
raw_html = simple_get('http://www.fabpedigree.com/james/mathmen.htm')
html = BeautifulSoup(raw_html, 'html.parser')
for i, li in enumerate(html.select('li')):
        print(i, li.text)

0  Isaac Newton
 Archimedes
 Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

1  Archimedes
 Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

2  Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

3  Leonhard Euler
 Bernhard Riemann

4  Bernhard Riemann

5  David Hilbert
 Joseph-Louis Lagrange
 Euclid  of Alexandria
 Alexandre Grothendieck
 Gottfried W. Leibniz

6  Joseph-Louis Lagrange
 Euclid  of Alexandria
 Alexandre Grothendieck
 Gottfried W. Leibniz

7  Euclid  of Alexandria
 Alexandre Grothendieck
 Gottfried W. Leibniz

8  Alexandre Grothendieck
 Gottfried W. Leibniz

9  Gottfried W. Leibniz

10  John von Neumann
 Henri Poincaré
 Évariste Galois
 Srinivasa Ramanujan
 Pierre de Fermat

11  Henri Poincaré
 Évariste Galois
 Srinivasa Ramanujan
 Pierre de Fermat

12  Évariste Galois
 Srinivasa Ramanujan
 Pierre de Fermat

13  Srinivasa Ramanujan
 Pierre de Fermat

14  Pierre de Fermat

15  Hermann K. H. Weyl
 Karl W. T. Weierstrass
 Brahmagupta
 Niels Abel
 René Descartes

16  Karl W. T

In [35]:
#Using BeautifulSoup to Get Mathematician Names from the website
raw_html = simple_get('https://www.connection-sphere.com/en/blogs/883471274311/?lang=en')
html = BeautifulSoup(raw_html, 'html.parser')
for i, li in enumerate(html.select('li')):
        print(i, li.text)

0 AI Solutions
1 
CONNECTIONS


Opportunities
Resources
Solutions
Trainings


2 Opportunities
3 Resources
4 Solutions
5 Trainings
6 JOIN US
7 LOGIN
8 
ORGANIZATION


Who We Are
FAQs

More



Send Us Your project
Contact Us




9 Who We Are
10 FAQs
11 
More



Send Us Your project
Contact Us


12 Send Us Your project
13 Contact Us
14 
LANGUAGES




                      English
                  



                      français
                  



15 

                      English
                  

16 

                      français
                  

17 Home
18 Contact
19 About
20 Personal Information You Provide
21 Information Collected Automatically
22 With Your Consent
23 With Service Providers
24 For Legal & Compliance Purposes
25 Platform refers to CSphere’s SaaS solution, including but not limited to advertising, job postings, real estate listings, business networking, online training, and investment facilitation.
26 User refers to any individual or entity accessing or u

In [36]:
#Function to extract single list of names
def get_names():
    """
    Downloads the page where the list of mathematicians is found
    and returns a list of strings, one per mathematician
    """
    url = 'http://www.fabpedigree.com/james/mathmen.htm'
    response = simple_get(url)

    if response is not None:
        html = BeautifulSoup(response, 'html.parser')
        names = set()
        for li in html.select('li'):
            for name in li.text.split('\n'):
                if len(name) > 0:
                    names.add(name.strip())
        return list(names)

    # Raise an exception if we failed to get any data from the url
    raise Exception('Error retrieving contents at {}'.format(url))

The above experiment shows that some of the <li> elements contain multiple names separated by newline characters, while others contain just a single name. With this information in mind, you can write your function to extract a single list of names:
<br><font color="blue">
L'expérience ci-dessus montre que certains éléments <li> contiennent plusieurs noms séparés par des sauts de ligne, tandis que d'autres n'en contiennent qu'un seul. En gardant ces informations à l'esprit, vous pouvez écrire votre fonction pour extraire une liste unique de noms :

In [37]:
#Getting the Popularity Score
def get_hits_on_name(name):
    """
    Accepts a `name` of a mathematician and returns the number
    of hits that mathematician's Wikipedia page received in the 
    last 60 days, as an `int`
    """
    # url_root is a template string that is used to build a URL.
    url_root = 'https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/{}'
    response = simple_get(url_root.format(name))

    if response is not None:
        html = BeautifulSoup(response, 'html.parser')

        hit_link = [a for a in html.select('a')
                    if a['href'].find('latest-60') > -1]

        if len(hit_link) > 0:
            # Strip commas
            link_text = hit_link[0].text.replace(',', '')
            try:
                # Convert to integer
                return int(link_text)
            except:
                log_error("couldn't parse {} as an `int`".format(link_text))

    log_error('No pageviews found for {}'.format(name))
    return None

You have reached a point where you can finally find out which mathematician is most beloved by the public! The plan is simple:

Get a list of names
Iterate over the list to get a “popularity score” for each name
Finish by sorting the names by popularity<br><br>

<font color="blue">
Vous êtes enfin arrivé au point où vous pouvez enfin découvrir quel mathématicien est le plus apprécié du public ! Le principe est simple :
<font color="blue">
Obtenez une liste de noms
<font color="blue">Parcourez la liste pour obtenir un score de popularité pour chaque nom
<font color="blue">Terminez en triant les noms par popularité

In [38]:
# Putting all together
if __name__ == '__main__':
    print('Getting the list of names....')
    names = get_names()
    print('... done.\n')

    results = []

    print('Getting stats for each name....')

    for name in names:
        try:
            hits = get_hits_on_name(name)
            if hits is None:
                hits = -1
            results.append((hits, name))
        except:
            results.append((-1, name))
            log_error('error encountered while processing '
                      '{}, skipping'.format(name))

    print('... done.\n')

    results.sort()
    results.reverse()

    if len(results) > 5:
        top_marks = results[:5]
    else:
        top_marks = results

    print('\nThe most popular mathematicians are:\n')
    for (mark, mathematician) in top_marks:
        print('{} with {} pageviews'.format(mathematician, mark))

    no_results = len([res for res in results if res[0] == -1])
    print('\nBut we did not find results for '
          '{} mathematicians on the list'.format(no_results))

Getting the list of names....
... done.

Getting stats for each name....
No pageviews found for Évariste Galois
No pageviews found for Andrey N. Kolmogorov
No pageviews found for Archimedes
No pageviews found for Godfrey H. Hardy
No pageviews found for Richard Dedekind
No pageviews found for Hermann G. Grassmann
No pageviews found for David Hilbert
No pageviews found for Diophantus  of Alexandria
No pageviews found for Albert Einstein
No pageviews found for Johann Bernoulli
No pageviews found for Alhazen ibn al-Haytham
No pageviews found for F.E.J. Émile Borel
No pageviews found for Muhammed al-Khowârizmi
No pageviews found for Pythagoras  of Samos
No pageviews found for James J. Sylvester
No pageviews found for Élie Cartan
No pageviews found for Niels Abel
No pageviews found for Carl Ludwig Siegel
No pageviews found for L.E.J. Brouwer
No pageviews found for Christiaan Huygens
No pageviews found for Girolamo Cardano
No pageviews found for Bháscara (II) Áchárya
No pageviews found for Ju

In [2]:
# Python 3.x script to scrape course listings from the site and save to CSV
# This script uses requests + BeautifulSoup + pandas for tabular handling

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def fetch_page(url, headers=None, timeout=10):
    """
    Fetch the HTML content of a page.
    English: Returns the page response text.  
    Français : Renvoie le texte HTML de la page.
    """
    resp = requests.get(url, headers=headers, timeout=timeout)
    resp.raise_for_status()  # raise exception for HTTP errors
    return resp.text

def parse_course_list(html):
    """
    Parse the HTML and extract course details into a list of dicts.
    English: Find all course elements, extract title, link, maybe description, price, date, etc.
    Français : Parcourir le HTML et récupérer chaque cours sous forme de dictionnaire.
    """
    soup = BeautifulSoup(html, "html.parser")
    courses = []
    # Example selector: adjust these selectors to match the actual site structure
    for card in soup.select("div.course-card"):
        title_elem = card.select_one("h3.course-title")
        link_elem  = card.select_one("a.course-link")
        desc_elem  = card.select_one("p.course-description")
        price_elem = card.select_one("span.course-price")
        date_elem  = card.select_one("span.course-date")
        
        title = title_elem.get_text(strip=True) if title_elem else ""
        link  = link_elem["href"] if link_elem and link_elem.has_attr("href") else ""
        desc  = desc_elem.get_text(strip=True) if desc_elem else ""
        price = price_elem.get_text(strip=True) if price_elem else ""
        date  = date_elem.get_text(strip=True) if date_elem else ""
        
        # Ensure link is full URL if relative
        if link and link.startswith("/"):
            link = requests.compat.urljoin("https://www.connection-sphere.com/en/trainings/courses", link)
        
        courses.append({
            "Title":       title,
            "URL":         link,
            "Description": desc,
            "Price":       price,
            "Date":        date
        })
    return courses

def main():
    # Entry URL for all courses
    base_url = "https://www.connection-sphere.com/en/trainings/courses/"
    print("Fetching page:", base_url)
    html = fetch_page(base_url, headers={"User-Agent":"Mozilla/5.0"})
    
    print("Parsing courses…")
    course_data = parse_course_list(html)
    
    if not course_data:
        print("Warning: no courses found. Check selectors or site layout.")
    else:
        print(f"Found {len(course_data)} courses.")
    
    # Create DataFrame and save to CSV
    df = pd.DataFrame(course_data)
    # Optional: reorder columns
    df = df[["Title", "URL", "Description", "Price", "Date"]]
    output_file = "connection_sphere_courses.csv"
    df.to_csv(output_file, index=False, encoding="utf‑8")
    print("Saved CSV to", output_file)

if __name__ == "__main__":
    try:
        main()
    except Exception as e:
        print("Error during execution:", e)


Fetching page: https://www.connection-sphere.com/en/trainings/courses/
Parsing courses…
Error during execution: "None of [Index(['Title', 'URL', 'Description', 'Price', 'Date'], dtype='object')] are in the [columns]"


In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import csv
import os
from urllib.parse import urljoin

def scrape_courses():
    """
    Scrape course data from Connection Sphere website and save to CSV
    """
    # Base URL
    base_url = "https://www.connection-sphere.com"
    courses_url = "https://www.connection-sphere.com/en/trainings/courses/"
    
    # Headers to mimic a real browser request
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    
    try:
        print("Starting web scraping...")
        
        # Send GET request to the website
        response = requests.get(courses_url, headers=headers)
        response.raise_for_status()  # Raise an exception for bad status codes
        
        # Parse HTML content
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find course elements - you may need to adjust these selectors based on the actual HTML structure
        course_cards = soup.find_all('div', class_=['course-card', 'training-item', 'course-item'])
        
        # If no specific classes found, try to find any divs that might contain course information
        if not course_cards:
            course_cards = soup.find_all('div', class_=lambda x: x and ('course' in x.lower() or 'training' in x.lower()))
        
        courses_data = []
        
        print(f"Found {len(course_cards)} potential course elements")
        
        for index, card in enumerate(course_cards):
            try:
                print(f"Processing course {index + 1}...")
                
                # Extract course information - adjust these selectors based on actual HTML structure
                course_data = {}
                
                # Course Title
                title_elem = card.find(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']) or card.find('a')
                course_data['Title'] = title_elem.get_text(strip=True) if title_elem else "Title not found"
                
                # Course Description
                desc_elem = card.find('p') or card.find('div', class_=lambda x: x and ('description' in x.lower() or 'content' in x.lower()))
                course_data['Description'] = desc_elem.get_text(strip=True) if desc_elem else "Description not found"
                
                # Course Link
                link_elem = card.find('a')
                if link_elem and link_elem.get('href'):
                    course_data['Link'] = urljoin(base_url, link_elem['href'])
                else:
                    course_data['Link'] = "Link not found"
                
                # Course Duration, Level, etc. - these will depend on the actual HTML structure
                # You'll need to inspect the website to find the correct selectors
                
                # Add additional fields as needed
                course_data['Category'] = "To be determined"
                course_data['Duration'] = "To be determined"
                course_data['Level'] = "To be determined"
                course_data['Price'] = "To be determined"
                
                courses_data.append(course_data)
                
                # Add a small delay to be respectful to the server
                time.sleep(0.5)
                
            except Exception as e:
                print(f"Error processing course {index + 1}: {str(e)}")
                continue
        
        # If no courses found with the initial approach, try alternative method
        if not courses_data:
            print("No courses found with initial selectors. Trying alternative approach...")
            courses_data = alternative_scraping_approach(soup, base_url)
        
        return courses_data
        
    except requests.RequestException as e:
        print(f"Error fetching the webpage: {str(e)}")
        return []
    except Exception as e:
        print(f"An unexpected error occurred: {str(e)}")
        return []

def alternative_scraping_approach(soup, base_url):
    """
    Alternative method to find courses if initial approach fails
    """
    courses_data = []
    
    # Look for any links that might lead to course pages
    course_links = soup.find_all('a', href=True)
    
    for link in course_links:
        href = link['href']
        # Filter for potential course links
        if any(keyword in href.lower() for keyword in ['course', 'training', 'program']):
            try:
                course_data = {}
                course_data['Title'] = link.get_text(strip=True) or "Course Title"
                course_data['Link'] = urljoin(base_url, href)
                course_data['Description'] = "Description to be extracted from detail page"
                course_data['Category'] = "To be determined"
                course_data['Duration'] = "To be determined"
                course_data['Level'] = "To be determined"
                course_data['Price'] = "To be determined"
                
                courses_data.append(course_data)
            except Exception as e:
                print(f"Error processing alternative course: {str(e)}")
                continue
    
    return courses_data

def save_to_csv(courses_data, filename="courses_data.csv"):
    """
    Save course data to CSV file
    """
    if not courses_data:
        print("No data to save.")
        return
    
    try:
        # Create DataFrame
        df = pd.DataFrame(courses_data)
        
        # Ensure we have all expected columns
        expected_columns = ['Title', 'Description', 'Link', 'Category', 'Duration', 'Level', 'Price']
        for col in expected_columns:
            if col not in df.columns:
                df[col] = "Not available"
        
        # Reorder columns
        df = df[expected_columns]
        
        # Save to CSV
        df.to_csv(filename, index=False, encoding='utf-8')
        print(f"Data successfully saved to {filename}")
        print(f"Total courses extracted: {len(courses_data)}")
        
        # Display first few rows
        print("\nFirst few rows of extracted data:")
        print(df.head())
        
    except Exception as e:
        print(f"Error saving to CSV: {str(e)}")

def main():
    """
    Main function to orchestrate the scraping process
    """
    print("Connection Sphere Course Scraper")
    print("=" * 40)
    
    # Scrape course data
    courses_data = scrape_courses()
    
    if courses_data:
        # Save to CSV
        save_to_csv(courses_data)
        
        # Additional information
        print(f"\nScraping completed successfully!")
        print(f"File saved as: courses_data.csv")
    else:
        print("No course data was extracted. The website structure might have changed.")
        print("Please inspect the website manually and update the CSS selectors in the script.")

if __name__ == "__main__":
    main()

Connection Sphere Course Scraper
Starting web scraping...
Found 0 potential course elements
No courses found with initial selectors. Trying alternative approach...
Data successfully saved to courses_data.csv
Total courses extracted: 4

First few rows of extracted data:
                     Title                                   Description  \
0                Trainings  Description to be extracted from detail page   
1          Add New Courses  Description to be extracted from detail page   
2   Self-pace Courses list  Description to be extracted from detail page   
3  Hybrid Courses Calendar  Description to be extracted from detail page   

                                                Link          Category  \
0    https://www.connection-sphere.com/en/trainings/  To be determined   
1  https://www.connection-sphere.com/en/trainings...  To be determined   
2  https://www.connection-sphere.com/en/trainings...  To be determined   
3  https://www.connection-sphere.com/en/trainings... 

### 🎯 Practical Legal Checklist
Before scraping any website, ask:

Is the data public and factual?

Did I check robots.txt?

Am I violating Terms of Service?

Am I scraping copyrighted content?

Is there an API available?

Am I being respectful with request rates?

What's my intended use of the data?

Could this harm the website's operations?

### 🔄 Safer Alternatives
When in doubt, use:
Official APIs (always preferred)

Public datasets (government portals, Kaggle)

Data licensing (pay for commercial data)

Data partnerships (formal agreements)

### 📞 When to Seek Legal Advice
Consult a lawyer if:

Commercial use of scraped data

Scraping competitors' sites

Handling personal/sensitive data

Large-scale scraping operations

Uncertain about legal boundaries

Bottom Line:
Web scraping is legal when done responsibly for:

Public data

Educational purposes

With proper respect for website resources

Without violating terms of service

But it becomes illegal when:

Bypassing paywalls/authentication

Stealing copyrighted content

Violating privacy laws

Harming website operations

When in doubt, err on the side of caution and seek permission or use official APIs!

<font color='blue'>
<h2>  🎯 Checklist juridique pratique</h2>

Avant d'extraire des données d'un site web, posez-vous les questions suivantes :

Les données sont-elles publiques et exactes ?

Ai-je vérifié le fichier robots.txt ?

Est-ce que j'enfreins les conditions d'utilisation ?

Est-ce que j'extrais du contenu protégé par le droit d'auteur ?

Existe-t-il une API disponible ?

Est-ce que je respecte le nombre de requêtes autorisées ?

Quelle est l'utilisation prévue des données ?

Cela pourrait-il nuire au fonctionnement du site web ?

### 🔄 Alternatives plus sûres

En cas de doute, utilisez :

API officielles (toujours privilégiées)

Jeux de données publics (portails gouvernementaux, Kaggle)

Licences de données (achat de données commerciales)

Partenariats de données (accords formels)

### 📞 Quand consulter un avocat

Consultez un avocat si :

Utilisation commerciale de données extraites

Extraction de données sur les sites concurrents

Gestion de données personnelles/sensibles

Opérations d’extraction à grande échelle

Incertain quant aux limites légales

En résumé :

L’extraction de données web est légale lorsqu’elle est effectuée de manière responsable pour :

Données publiques

Finalités éducatives

Respect des ressources du site web

Sans enfreindre les conditions d’utilisation

Mais elle devient illégale lorsque :

Contournement des paywalls/de l’authentification

Vol de contenu protégé par le droit d’auteur

Violation des lois sur la protection de la vie privée

Perturbation du fonctionnement du site web

En cas de doute, privilégiez la prudence et demandez l’autorisation ou utilisez les API officielles !