# Web Scraping with Python: Unleashing the Power of Data Extraction

##### Writer: [Farzad Asgari](https://github.com/farzadasgari)

Welcome to the world of web scraping with Python! In this exciting journey, we'll delve into the art of extracting valuable data from the vast ocean of the internet. Web scraping, also known as web harvesting, is the automated process of gathering information from websites. Imagine a scenario where you stumble upon a goldmine of data on a website, but there's no convenient download option. This is where web scraping comes to the rescue!

By leveraging the power of Python, you can bypass manual data collection and build a powerful toolset for programmatic data extraction. This skill proves invaluable for data scientists, analysts, and anyone who works with large datasets.

**Why Web Scraping with Python?**

Python offers a compelling advantage for web scraping due to its:

* **Readability:** Python's syntax is known for its clarity and ease of learning. This makes it ideal for building and understanding web scraping scripts, even for beginners.
* **Rich Ecosystem of Libraries:** Python boasts a vast collection of libraries specifically designed for web scraping tasks. Libraries like `requests` and `Beautiful Soup` simplify the process of fetching website content and parsing its structure.
* **Versatility:** Python's capabilities extend beyond web scraping. It can seamlessly integrate with data analysis tools like pandas and NumPy, allowing you to clean, manipulate, and analyze the extracted data efficiently. 

**Real-World Applications of Web Scraping: A Powerful Tool for Data Acquisition**

Web scraping has emerged as a versatile technique for extracting valuable data from the vast ocean of information on the web. It empowers businesses, researchers, and individuals to gather targeted information that fuels various applications. Here are some compelling examples that illustrate the power of web scraping:

* **E-commerce Price Monitoring:** Imagine being able to track the prices of your desired products across different online stores, all in one place. Web scraping can automate this process by extracting product information and prices from various e-commerce websites. This allows you to identify the best deals and make informed purchasing decisions. 

   * **Benefits:** Save money by finding the most competitive prices, track price trends to predict future fluctuations, and receive alerts when prices drop on your desired items.
   * **Example:** A company can scrape product data from major retailers to build a price comparison tool for their customers.

* **Market Research:** Gaining deep insights into customer trends, competitor strategies, and industry landscapes is crucial for businesses of all sizes. Web scraping can be a valuable tool in this regard. By scraping relevant data from industry websites, social media platforms, and news articles, businesses can:

   * **Benefits:** Understand customer preferences and buying habits, analyze competitor offerings and pricing strategies, identify emerging trends within the industry, and gather data to support informed business decisions.
   * **Example:** A marketing team can scrape data on social media platforms to understand customer sentiment towards their brand and competitor brands. 

* **Social Media Analysis:** The power of social media lies in the vast amount of user-generated content. Web scraping allows you to tap into this valuable data source for social media analysis. By scraping data from social media platforms, you can:

   * **Benefits:** Monitor brand sentiment and identify areas for improvement, analyze user behavior and preferences to tailor marketing strategies, track trending topics and hashtags to stay ahead of the curve, and gather data to measure the effectiveness of social media campaigns.
   * **Example:** A social media influencer can scrape data on audience demographics and engagement metrics to refine their content strategy.

* **News Aggregation:** Staying informed about current events and industry news is essential. Web scraping can help you create a personalized news feed by automatically collecting articles from various sources. You can customize it to focus on specific topics or  industries that interest you.

   * **Benefits:** Stay up-to-date on the latest news and trends, create a consolidated feed from your preferred sources, and avoid information overload by filtering content based on your interests.
   * **Example:** A financial analyst can scrape news articles from financial websites to build a personalized feed of market news and company announcements.

* **Building Datasets:** Machine learning and data analysis projects often require large datasets for training models and generating insights. Web scraping can be employed to extract data from relevant websites to build datasets for various purposes.

   * **Benefits:** Create custom datasets tailored to your specific needs, gather data that might not be readily available through traditional methods, and supplement existing datasets to enhance their richness and usefulness.
   * **Example:** A researcher can scrape data on weather patterns from historical weather websites to build a dataset for climate change analysis.

**Important Considerations:**

- **Website Terms of Service:** Always check the website's terms of service before scraping data. Some websites might explicitly prohibit scraping activities.
- **Data Quality and Ethics:** Ensure the scraped data is accurate and relevant to your needs. Respect ethical considerations by scraping responsibly and avoiding overloading website servers.

Web scraping, when used responsibly and ethically, offers a powerful tool for data acquisition across diverse fields. It empowers users to gather valuable information, gain insights, and make informed decisions. 

**Ethical Considerations and Responsible Scraping:**

It's crucial to practice web scraping responsibly. Here are some key points to remember:

* **Respect robots.txt:** Websites often have a robots.txt file specifying which pages or content can be scraped. Always adhere to these guidelines.
* **Avoid overloading servers:** Be mindful of the frequency of your scraping requests. Don't overwhelm a website's server with excessive traffic.
* **Extract only relevant data:** Focus on extracting the data you need and avoid scraping unnecessary information.
* **Use proper attribution:** If you plan to publish scraped data, ensure you provide proper attribution to the source website.

**What's Next?**

Now that we've explored the fundamentals of web scraping with Python, it's time to dive deeper! In the following sections of this notebook, we'll embark on a hands-on learning experience. We'll explore how to set up our Python environment, install essential libraries, and write our first web scraping script. We'll delve into techniques for parsing HTML content, extracting specific data points, and handling potential challenges. Finally, we'll explore best practices for storing and analyzing the extracted data.

Get ready to unleash the power of web scraping with Python and unlock the potential of data hidden within the vast web!


## The Powerhouse of Web Scraping: Requests Library

The Requests library is an essential tool in the web scraper's arsenal. It streamlines the process of making HTTP requests, the foundation of any web scraping workflow. Whenever you visit a website, your browser sends an HTTP request to the server, retrieving the data that populates the web page. Requests replicates this process programmatically in Python.

<center><img src="https://requests.readthedocs.io/en/latest/_static/requests-sidebar.png" width="200px"/></center>

**Beyond Built-in Modules: Why Requests Shines**

While Python offers built-in modules like `urllib` and `urllib2` for HTTP requests, Requests provides a more user-friendly and efficient alternative. Here's why developers favor Requests:

* **Simplified API:** Requests boasts a clean and intuitive API, making it easier to write and understand code compared to the often convoluted syntax of `urllib` and `urllib2`.
* **Reduced Boilerplate Code:** Requests eliminates the need for extensive code to perform basic HTTP requests. It handles common tasks like setting headers, sending data, and receiving responses with minimal effort.
* **Improved Readability:** Requests promotes cleaner code by separating concerns. You focus on specifying the request details (URL, method, headers, etc.), while Requests takes care of the underlying complexities of HTTP communication.

**Requests: A Stepping Stone in the Scraping Journey**

While Requests excels at fetching website data, it's just the first step in building a robust web scraping tool. Additional components are needed for a complete "web scraping spider":

* **Scheduling and Parallelization:** You'll likely need custom logic to control the frequency and order of requests to avoid overloading servers.
* **Data Parsing:** Libraries like BeautifulSoup come into play later to parse the retrieved HTML content, extract specific data points, and navigate website structures.

**Summary:**

The Requests library simplifies the act of making HTTP requests, a fundamental step in web scraping. Its user-friendly API and streamlined approach make it the preferred choice for developers compared to built-in Python modules. Remember, Requests is a powerful tool, but it's just the beginning of your web scraping adventure!

## Unveiling the Power of BeautifulSoup for Web Scraping

BeautifulSoup takes center stage when it comes to parsing data extracted from websites. Unlike Requests, which focuses on retrieving web pages, BeautifulSoup specializes in extracting valuable information from the raw HTML or XML content.

<center><img src="https://www.crummy.com/software/BeautifulSoup/10.1.jpg" width="200px"/></center>

**The Perfect Partnership: Requests and BeautifulSoup**

Since BeautifulSoup can't directly fetch web pages, it often works hand-in-hand with the Requests library. Here's the winning combination:

1. **Requests takes the lead:** It initiates the HTTP request, retrieving the target web page.
2. **BeautifulSoup steps in:** Once the HTML content is obtained, BeautifulSoup parses it, allowing you to extract specific data points.

**Simplicity is Beauty: The Advantages of BeautifulSoup**

BeautifulSoup shines in its user-friendly nature and its ability to streamline repetitive tasks during data parsing. Here's what makes it a favorite among web scrapers:

* **Intuitive API:** BeautifulSoup provides a clean and straightforward API. With just a few lines of code, you can navigate the parsed HTML document and locate the data you need, making complex parsing tasks more manageable.
* **Automation Powerhouse:** BeautifulSoup automates repetitive aspects of data parsing. You can easily find all instances of specific data points (like links) within a document, saving you time and effort.
* **Encoding Savior:** BeautifulSoup automatically detects encodings, including special characters, ensuring your extracted data remains accurate and avoids encoding-related issues.

**Summary:**

BeautifulSoup is a powerful tool for parsing website data. Its user-friendliness and automation capabilities make it an ideal companion to the Requests library. Together, they form a formidable team for your web scraping endeavors. Remember, Requests retrieves the content, while BeautifulSoup helps you make sense of it!

## Scrapy: The All-in-One Powerhouse for Web Scraping

Scrapy emerges as the champion in the web scraping arena. It's an open-source Python framework specifically designed to streamline the web scraping process. Developed by Scrapinghub co-founders, Scrapy takes the complexity out of building and configuring web scraping tools, aptly nicknamed "spiders." 

<center><img src="https://scrapy.org/img/scrapylogo.png" width="200px"/></center>

**Beyond Individual Libraries: Embrace the Framework**

Imagine a solution that eliminates the need to juggle multiple libraries! Scrapy offers a comprehensive toolkit, handling a significant portion of the work for you. It seamlessly addresses common challenges and edge cases that you might not have anticipated.

**Get Scraping in Minutes: Effortless Efficiency**

Within minutes of installation, you can have a fully functional spider up and running, scraping data from the web. Scrapy streamlines the process by:

* Downloading HTML content.
* Parsing and processing the extracted data.
* Saving the data in user-friendly formats like CSV, JSON, or XML.

**Built-in Features for Robust Scraping**

Scrapy doesn't stop there. It provides a rich set of built-in extensions and middlewares designed to handle:

* **Cookies and Sessions:** Manage user sessions and authentication seamlessly.
* **HTTP Features:** Leverage features like compression, authentication, caching, user-agents, robots.txt adherence, and crawl depth restriction for responsible scraping.

**Customization Made Easy: Extend Your Reach**

Scrapy empowers you to tailor your scraping project with ease. You can:

* Develop custom middlewares to handle specific functionalities not covered by built-in features.
* Create custom pipelines to define how the extracted data is processed, stored, or exported.

**Speed Boost with Asynchronous Magic**

One of Scrapy's biggest advantages lies in its foundation – Twisted, an asynchronous networking library. This translates to significant performance gains:

* **Parallel Requests:** Scrapy spiders can make multiple HTTP requests simultaneously, not waiting for each response before initiating the next.
* **Agile Data Parsing:** Data parsing can begin as soon as the server starts returning data, further enhancing efficiency.

**Addressing the JavaScript Challenge**

While Scrapy doesn't natively handle JavaScript execution, the team at Scrapinghub has created a solution: Splash. This lightweight, scriptable headless browser seamlessly integrates with Scrapy, enabling you to tackle websites that rely heavily on JavaScript.

**Learning Curve with Benefits**

Admittedly, Scrapy's learning curve is slightly steeper compared to using a library like BeautifulSoup. However, the rewards outweigh the initial effort. Here's what you gain:

* **Excellent Documentation:** Comprehensive documentation provides a clear path to mastering Scrapy.
* **Active Community:** A vibrant community of developers on GitHub and StackOverflow offer support and contribute new plugins, making Scrapy a constantly evolving framework.

**Summary:**

Scrapy simplifies the web scraping journey, offering a robust and efficient framework. While it requires a bit more initial investment in learning, the long-term benefits outweigh the effort. With its built-in features, customization options, and asynchronous processing, Scrapy empowers you to build powerful and scalable web scrapers.  

## Selenium: The Power Behind Scraping Complex Websites

Selenium stands out from the crowd as a web driver, not originally designed for scraping, but offering unique capabilities. Its primary function lies in automated testing of web applications, mimicking a web browser's behavior.

<center><img src="https://selenium-python.readthedocs.io/_static/logo.png" width="200px"/></center>

**Why is Selenium Valuable for Web Scraping?**

Modern websites often rely heavily on JavaScript to dynamically update content after the initial page load. Traditional web scraping tools may miss this dynamic data because they don't execute JavaScript. Selenium solves this challenge.

**The Edge of Selenium: Executing JavaScript for Complete Data**

When a Selenium-powered "spider" visits a page, it executes all available JavaScript before making the content available for parsing. This unlocks a significant advantage:

* **Access to JavaScript-Generated Data:** You can scrape data that wouldn't be accessible without JavaScript or a full browser. This allows for a more comprehensive extraction of the website's content.

**The Trade-off: Speed vs. Depth**

Selenium's power comes at a cost: speed. Executing all JavaScript on every page can be significantly slower compared to a simple HTTP request. Here's a breakdown:

* **Ideal for Selective Scraping:** If speed isn't a major concern, and the scale of your scraping project is manageable, Selenium can be a good choice.
* **Not for Large-Scale Projects:** When speed becomes critical, or you're dealing with large-scale scraping, executing JavaScript on every page becomes impractical. A more sophisticated approach is necessary.

**Beyond Individual Libraries: The Power of Frameworks**

As you've seen, each Python library tackles a specific aspect of web scraping. Building a fully functional web scraping tool often requires combining multiple libraries. Fortunately, there's a simpler solution:

**Introducing Scrapy: A Framework for Effortless Scraping**

Scrapy offers a purpose-built web scraping framework. It provides all the core components needed to create a web scraper "out of the box." Additionally, Scrapy boasts a vast collection of plugins designed to handle various web scraping complexities.

**Summary:**

Selenium empowers you to scrape data hidden behind JavaScript, but at a cost of speed. If simplicity and efficiency are your priorities, consider exploring Scrapy, a comprehensive web scraping framework. Remember, the choice of tool depends on your specific needs and the complexity of the website you're targeting.


---

## Responsible Web Scraping with Requests & BeautifulSoup
Absolutely! Web scraping can be a valuable tool, but it's crucial to exercise responsible practices. Here's what to keep in mind:

Respecting Website Policies:

Check robots.txt: Many websites have a robots.txt file specifying allowed and disallowed content for scraping. Always adhere to these guidelines.
Avoid overwhelming servers: Be mindful of the frequency and volume of your requests. Don't overload a website's server with excessive traffic.
Extract only relevant data: Focus on extracting the data you need and avoid scraping unnecessary information.
Use proper attribution: If you plan to publish scraped data, ensure you provide proper attribution to the source website.
Understanding Website Structure:

Before scraping, take some time to analyze the target website's structure and layout. This helps you identify the specific HTML elements containing the data you want to extract. Let's explore how to use Requests and BeautifulSoup for responsible scraping, starting with a practice example using Wikipedia.

### Example: Extracting Data from Wikipedia

If you're working in a Jupyter Notebook environment, installing libraries directly within the notebook cells is a good practice. Use this code snippet to directly install `beautifulsoup4` and `requests` from jupyter.

Running the Code:

1. **Highlight the code cells:** Select the cells containing the installation commands.

2. **Uncomment the cells:** Remove the hash symbol `#` at the beginning of the line.

3. **Run the cells:** Once uncommented, use the `run` button or keyboard shortcut (often `Shift+Enter`) to execute the code.



In [1]:
# !pip install bs4
# !pip install requests


**Verification:**

After running the cells, you can try importing the libraries in another cell to verify the installation.


If there are no errors, the libraries are installed and ready to use!

In [2]:
import requests
from bs4 import BeautifulSoup

print("BeautifulSoup and Requests installed successfully!")

BeautifulSoup and Requests installed successfully!


**Alternative Installation (Outside Jupyter):**

If you prefer installing libraries outside the Jupyter Notebook environment, you can open a terminal and use the same commands:

<div class="alert alert-block" style="margin-left:20px; margin-right:20px; background-color: #3d3d3d; color: #fff">
    \$ pip install bs4 <br>
    \$ pip install requests
</div>

This will install the libraries globally on your system, making them available in any Python environment.


**Importing Essential Tools**

In [3]:
import requests  # Makes HTTP requests to web pages
import bs4  # Parses HTML and XML content
import pandas as pd  # Used for data manipulation and creating DataFrames

**Defining Our Target: The Wikipedia Page**

Before we begin extracting data, we need to tell our code where to look. This involves specifying the URL (Uniform Resource Locator) of the web page we want to scrape. In this example, let's use the Wikipedia page on Artificial Intelligence:

In [4]:
url = 'https://en.wikipedia.org/wiki/Artificial_intelligence'

In [5]:
data_page = requests.get(url)

When you run this code, you'll get a large block of text representing the entire HTML content of the Wikipedia page.

**Inspecting the Retrieved Data:**

Let's take a peek at what we've retrieved. The data_page.text property provides access to the raw HTML content of the webpage as a string:

In [6]:
data_page.text



**Understanding the Transformation: Converting to BeautifulSoup**

While the raw HTML content holds the information we need, it's not in a very user-friendly format. Enter `BeautifulSoup`! This library transforms the raw HTML content into a structured object that allows us to navigate the website's elements and extract data more easily:

In [7]:
data_page_soup = bs4.BeautifulSoup(data_page.text, 'html.parser')

In [8]:
data_page_soup

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Artificial intelligence - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vecto

**Unveiling the Power of BeautifulSoup**

The `BeautifulSoup` object we've created is like a treasure map, guiding us to the valuable data hidden within the HTML content. With this powerful tool in hand, we can embark on exciting data extraction adventures!

Let's explore some of the amazing things we can accomplish:

**Extracting the Page Title**

Now that we have a BeautifulSoup object representing the webpage's structure, let's unveil the title! Websites typically use the `<title>` element to specify the page's title, and BeautifulSoup offers powerful methods to locate these elements.

1. **Targeting the Title Element:** 
   We use the `select` method with the HTML/CSS selector `"title"` to find all elements matching this criteria within our BeautifulSoup object (`data_page_soup`). This will likely return a list containing a single element, the page's title element.

In [9]:
title = data_page_soup.select('title')

In [10]:
title

[<title>Artificial intelligence - Wikipedia</title>]

2. **Grabbing the Text:** 
   Assuming there's only one title (common for webpages), we access the first element in the list using `titles[0]`. Then, we use the `getText()` method to extract the actual text content from the title element.

In [12]:
title_text = title[0].getText()

3. **Revealing the Title:** 
   By running `titles[0].getText()`, we'll unveil the title of the Wikipedia page we scraped. This will most likely be "Artificial intelligence - Wikipedia".


In [13]:
title_text

'Artificial intelligence - Wikipedia'

**Additional Notes:**

* You can check the length of the `titles` list using `len(titles)` to verify if there's only one title element.
* If you encounter websites with multiple title elements (less common), you might need to adjust your code to handle those scenarios.

**Identifying the Headline Class**

Before we extract the title headlines from the Wikipedia page, we need to determine the specific class associated with these elements. Here's how we'll approach this task:

1. **Inspecting the Webpage:** The key lies in examining the HTML structure of the Wikipedia page. We can use our browser's developer tools to delve into the code and identify the class name applied to the title headline elements.

   - **Opening Developer Tools:** Right-click anywhere on the Wikipedia page and select "Inspect" or "Inspect Element" (the option might vary slightly depending on your browser). This will open the developer tools window.

   - **Locating the Headlines:** Navigate to the section of the webpage where the title headlines reside. Use the developer tools to pinpoint these elements. You can hover over elements on the page to see their corresponding HTML code in the developer tools window.

   - **Identifying the Class:** Look for the class attribute within the HTML code for the title headline elements. This attribute will likely be named something like `"class"` and will have a value representing the class name (e.g., `"mw-headline"`).

2. **Refining the Code (Optional):**

   - Once you've identified the correct class name, replace `"mw-headline"` in the following code snippet with the actual class name you discovered on the Wikipedia page:

   ```python
   titles_class = data_page_soup.select('.<class_name>')  # Replace '<class_name>' with the actual class name
   ```

3. **Understanding the Output:**

   - After running the code with the correct class name, printing `titles_class` will display a list of elements. Each element in this list represents a title headline on the Wikipedia page because they all share the identified class name.

**Moving Forward:**

With the class name in hand, we can use BeautifulSoup's methods to extract the text content of each title headline or perform further processing based on your specific needs. The next steps in your notebook will likely involve iterating through the elements in `titles_class` to extract the desired data.

In [14]:
titles_class = data_page_soup.select('.mw-headline')

In [15]:
titles_class

[<span class="mw-headline" id="Goals">Goals</span>,
 <span class="mw-headline" id="Reasoning_and_problem-solving">Reasoning and problem-solving</span>,
 <span class="mw-headline" id="Knowledge_representation">Knowledge representation</span>,
 <span class="mw-headline" id="Planning_and_decision-making">Planning and decision-making</span>,
 <span class="mw-headline" id="Learning">Learning</span>,
 <span class="mw-headline" id="Natural_language_processing">Natural language processing</span>,
 <span class="mw-headline" id="Perception">Perception</span>,
 <span class="mw-headline" id="Social_intelligence">Social intelligence</span>,
 <span class="mw-headline" id="General_intelligence">General intelligence</span>,
 <span class="mw-headline" id="Techniques">Techniques</span>,
 <span class="mw-headline" id="Search_and_optimization">Search and optimization</span>,
 <span class="mw-headline" id="State_space_search">State space search</span>,
 <span class="mw-headline" id="Local_search">Local sea

Here's the revised code and explanation for your notebook cell:

**Extracting Headline Text**

Now that we have a list of elements (`titles_class`) containing the title headlines, let's extract the actual text content of each headline.

1. **Iterating Through Headlines:** The `for` loop iterates through each element in the `titles_class` list, which represents the title headline elements.
2. **Accessing Text Content:** Inside the loop, for each element (`title`), we use the `.text` property to extract the text content. This is the text displayed within the title headline element on the webpage.

In [16]:
for title in titles_class:
    print(title.text)

Goals
Reasoning and problem-solving
Knowledge representation
Planning and decision-making
Learning
Natural language processing
Perception
Social intelligence
General intelligence
Techniques
Search and optimization
State space search
Local search
Logic
Probabilistic methods for uncertain reasoning
Classifiers and statistical learning methods
Artificial neural networks
Deep learning
GPT
Specialized hardware and software
Applications
Health and medicine
Games
Finance
Military
Generative AI
Other industry-specific tasks
Ethics
Risks and harm
Privacy and copyright
Misinformation
Algorithmic bias and fairness
Lack of transparency
Bad actors and weaponized AI
Reliance on industry giants
Technological unemployment
Existential risk
Ethical machines and alignment
Open source
Frameworks
Regulation
History
Philosophy
Defining artificial intelligence
Evaluating approaches to AI
Symbolic AI and its limits
Neat vs. scruffy
Soft vs. hard computing
Narrow vs. general AI
Machine consciousness, sentience

**Difference Between `.getText()` and `.text`:**

In BeautifulSoup, both `.getText()` and `.text` generally achieve the same outcome: extracting the text content from an element. However, there might be subtle differences depending on the specific element and parsing options. Here's a breakdown:

   - **`.text`:** This is a more concise way to access the text content. It's generally the preferred method and works well in most cases.
   - **`.getText()`:** This method can offer slightly more control over the text extraction process. It allows you to specify additional arguments like handling whitespace or stripping tags (removing HTML tags from the extracted text). However, for basic text extraction, `.text` is sufficient.

**Additional Considerations:**

- You can modify the code within the loop to perform further processing on the extracted text content, such as filtering for specific keywords or storing the text in a list for later use.
- If you encounter any issues with the text extraction, you might need to explore BeautifulSoup's more advanced text handling methods like `.get_text(strip=True)` (removes leading/trailing whitespace) or `.string` (extracts only the first string within the element).

**Alternative Approach: Targeting Titles with Spans**

While we previously identified a class containing the title headlines, there might be other ways to achieve the same result. Here's an alternative approach using HTML structure:

1. **Targeting Specific Structure:** This code snippet uses a CSS selector to target elements. The selector `'h3 > span'` selects all `<span>` elements that are direct children (`>`) of `<h3>` (heading level 3) elements.

   - The assumption here is that the title headlines on the Wikipedia page use `<h3>` elements with child `<span>` elements containing the actual text content. You might need to adjust the `<h3>` tag depending on the actual HTML structure of the page you're scraping.


In [17]:
titles_class_spans = data_page_soup.select('h3 > span')  # You can adjust the h tag (h2, h3, etc.) based on the actual structure


2. **Understanding the Output:**

   - Running this code will display a list of elements (`titles_class1`). Each element represents a `<span>` element that's a direct child of an `<h3>` element, potentially containing the title text.  

**Comparison to Previous Approach:**

- The previous approach relied on identifying a specific class name associated with the title headlines. 
- This alternative approach leverages the HTML structure by targeting specific tags and their relationships.

**Choosing the Right Approach:**

The best approach depends on the structure of the webpage you're scraping. If there's a unique class consistently applied to title headlines, using that class might be more efficient. However, if the class name is subject to change or there's no clear class, targeting the structure with tags can be a reliable option.

**Additional Considerations:**

- It's always recommended to inspect the webpage's HTML structure to ensure the chosen selector targets the desired elements.
- You can experiment with different selectors (e.g., using other heading tags like `<h2>`) to see which one works best for the specific webpage.


**Extracting Links from the Webpage**

So far, we've focused on extracting title headlines. But what if we want to grab all the links on the Wikipedia page? BeautifulSoup provides ways to target and extract links as well.

**Selecting Links with `select()`:**

In [18]:
links = data_page_soup.select('link')

**Explanation:**

   - This line utilizes the `select` method with the CSS selector `"link"`. This selector targets all elements of the `<link>` tag on the webpage.
   - `<link>` tags are typically used to link external resources like stylesheets or icons to the webpage.

**Checking the Results:**

- Running `links` by itself might not be very informative as it just displays the list object.

In [19]:
links

[<link href="/w/load.php?lang=en&amp;modules=ext.cite.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cext.wikimediamessages.styles%7Cjquery.makeCollapsible.styles%7Cskins.vector.icons%2Cstyles%7Cskins.vector.search.codex.styles%7Cwikibase.client.init&amp;only=styles&amp;skin=vector-2022" rel="stylesheet"/>,
 <link href="/w/load.php?lang=en&amp;modules=site.styles&amp;only=styles&amp;skin=vector-2022" rel="stylesheet"/>,
 <link href="//upload.wikimedia.org" rel="preconnect"/>,
 <link href="//en.m.wikipedia.org/wiki/Artificial_intelligence" media="only screen and (max-width: 640px)" rel="alternate"/>,
 <link href="/static/apple-touch/wikipedia.png" rel="apple-touch-icon"/>,
 <link href="/static/favicon/wikipedia.ico" rel="icon"/>,
 <link href="/w/rest.php/v1/search" rel="search" title="Wikipedia (en)" type="application/opensearchdescription+xml"/>,
 <link href="//en.wikipedia.org/w/api.php?action=rsd" rel="EditURI" type="application

**Finding All Links with `find_all()`:**

In [20]:
for link in data_page_soup.find_all('link'):
       print(link)

<link href="/w/load.php?lang=en&amp;modules=ext.cite.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cext.wikimediamessages.styles%7Cjquery.makeCollapsible.styles%7Cskins.vector.icons%2Cstyles%7Cskins.vector.search.codex.styles%7Cwikibase.client.init&amp;only=styles&amp;skin=vector-2022" rel="stylesheet"/>
<link href="/w/load.php?lang=en&amp;modules=site.styles&amp;only=styles&amp;skin=vector-2022" rel="stylesheet"/>
<link href="//upload.wikimedia.org" rel="preconnect"/>
<link href="//en.m.wikipedia.org/wiki/Artificial_intelligence" media="only screen and (max-width: 640px)" rel="alternate"/>
<link href="/static/apple-touch/wikipedia.png" rel="apple-touch-icon"/>
<link href="/static/favicon/wikipedia.ico" rel="icon"/>
<link href="/w/rest.php/v1/search" rel="search" title="Wikipedia (en)" type="application/opensearchdescription+xml"/>
<link href="//en.wikipedia.org/w/api.php?action=rsd" rel="EditURI" type="application/rsd+xml"/>
<li

**Explanation:**

   - The `find_all` method is a versatile way to locate all elements matching a specific tag. Here, it finds all occurrences of the `<link>` tag.
   - The `for` loop iterates through each element (`link`) in the list returned by `find_all`.
   - Inside the loop, printing `link` will display the complete HTML code for each `<link>` element found on the page.

**Understanding the Output:**

- By running this loop, you'll see the complete HTML code for each `<link>` element, including its attributes like the `href` attribute, which might contain the URL of the linked resource.

**Key Differences:**

- `select` is generally more efficient for targeting elements with specific CSS selectors (like targeting only `<link>` elements with a specific class name).
- `find_all` is more comprehensive, finding all occurrences of a specific tag (like all `<link>` elements regardless of their attributes).

**Choosing the Right Method:**

The choice between `select` and `find_all` depends on your needs. If you know the specific structure of the links you want to extract (e.g., all `<link>` elements with a specific class), `select` might be more efficient. If you want to grab all links regardless of their attributes, `find_all` is the way to go.

**Additional Considerations:**

- You might need to further process the extracted links to obtain the actual URLs. This often involves accessing the `href` attribute within the `<link>` element.
- There might be different types of links on a webpage (internal vs. external). You can explore filtering techniques to refine your selection based on specific criteria.

---

**Extracting and Joining Image URLs**

Now that we have a BeautifulSoup object representing the webpage's structure, let's embark on a mission to extract the URLs of all images present on the Wikipedia page. Here's a breakdown of the code and its functionality:

1. **Finding Image Elements:**

   - The code begins by using `data_page_soup.findAll('img')`. This line utilizes BeautifulSoup's `findAll` method (similar to `find_all`) to locate all elements on the webpage that match the `<img>` tag. The `<img>` tag is used to represent images on webpages.

   - The result of `findAll` is a list containing all the `<img>` elements found on the page. Each element in this list represents a single image.

In [21]:
images = data_page_soup.findAll('img')


2. **Building the Image URL List:**

   - We create an empty list named `images_urls` to store the extracted image URLs.

   - The code then iterates through the list of image elements (`images`) using a `for` loop.

   - Inside the loop, for each `image` element, we access its `'src'` attribute using `image['src']`. The `'src'` attribute within an `<img>` element typically specifies the URL of the image file.

   - The `urljoin` function from `urllib.parse` comes into play here. It takes two arguments: the base URL (the URL of the webpage we scraped) and the relative URL extracted from the `'src'` attribute. The `urljoin` function intelligently combines these parts to form the complete image URL.

   - The combined image URL is then appended to the `images_urls` list using the `append` method. This process continues for each image element in the loop, building a list containing all the extracted image URLs.

In [22]:
from urllib.parse import urljoin

In [23]:
images_urls = []
for image in images:
    images_urls.append(urljoin(url, image['src']))

3. **Printing the Image URLs:**

   - The final `for` loop iterates through the `images_urls` list, which now holds all the extracted image URLs.

   - Inside the loop, for each `image_url`, we simply print it using `print`. This will display each image URL on a separate line, allowing you to see the URLs of all images found on the webpage.

In [24]:
for image_url in images_urls:
    print(image_url)

https://en.wikipedia.org/static/images/icons/wikipedia.png
https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-en.svg
https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-en.svg
https://upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/6/64/Dall-e_3_%28jan_%2724%29_artificial_intelligence_icon.png/100px-Dall-e_3_%28jan_%2724%29_artificial_intelligence_icon.png
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/General_Formal_Ontology.svg/260px-General_Formal_Ontology.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/Kismet-IMG_6007-gradient.jpg/220px-Kismet-IMG_6007-gradient.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/a/a3/Gradient_descent.gif/220px-Gradient_descent.gif
https://upload.wikimedia.org/wikipedia/commons/thumb/0/0e/SimpleBayesNet.svg/380px-SimpleBayesNet.svg.png
https://upload.wikimed

**Key Points:**

- This code snippet demonstrates how to extract image URLs from a webpage using BeautifulSoup and the `urljoin` function.
- It leverages BeautifulSoup's ability to locate specific elements and access their attributes.
- The `urljoin` function ensures that relative URLs are correctly combined with the base URL to form complete image URLs.

**Additional Considerations:**

- Depending on the complexity of the website, image URLs might be constructed differently. You might need to adjust the code to handle variations in how image sources are specified.
- Not all images might have a valid `'src'` attribute. You can incorporate error handling to gracefully handle such cases.