# WEB SCRAPPING :
---
## Write a Python program to scrape the following data:

- ### Scrape all the text from any web page.

- ### Scrape only a particular div from any web page.

- ### Scrape all the tables from any web page.
---
## Installing the **'requests'** and **'beautifulsoup4'** libraries, which are used for making HTTP requests and parsing HTML and XML documents, respectively.

In [1]:
%pip install requests
%pip install beautifulsoup4



## Importing the **'requests'** library, which is used for making HTTP requests in Python.
- ### Importing the **'BeautifulSoup'** class from the **bs4** library. **BeautifulSoup** is a Python library for pulling data out of **HTML** and **XML** files. It creates a parse tree for parsed pages that can be used to **extract data from HTML**, which is useful for **web scraping**.

In [2]:
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Marvel_Cinematic_Universe' # Adding url

div_id = 'mw-content-text'  # Assigning the string 'mw-content-text' to the variable div_id

response = requests.get(url)  # Sending a GET request to the URL stored in the url variable and stores the
                              # response in the response variable.
soup = BeautifulSoup(response.content, 'html.parser')  # Parsing the HTML content from the response and creates
                                                       # a BeautifulSoup object, which can be used to extract data.

## Below line extracts all the text from the parsed HTML content and prints it to the console :
- ### **all_text = soup.get_text()** : This part of the code extracts all the text content from the **'soup'** object, which represents the parsed HTML structure of the web page. The extracted text is then assigned to the variable **'all_text'**.

In [3]:
all_text = soup.get_text()
print(all_text)





Marvel Cinematic Universe - Wikipedia



































Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload file



















Search











Search






















Appearance
















Create account

Log in








Personal tools





 Create account Log in





		Pages for logged out editors learn more



ContributionsTalk




























Contents
move to sidebar
hide




(Top)





1
Development




Toggle Development subsection





1.1
Marvel Studios films and series






1.1.1
The Infinity Saga films








1.1.2
The Multiverse Saga films and series










1.2
Marvel Television series








1.3
Expansion to other media








1.4
Business practices










2
Feature films




Toggle Feature films subsection





2.1
The Infinity S

## Finding the **'div'** with the id **'mw-content-text'** and printing the text of all the paragraph and heading tags within it.

In [5]:
specific_div = soup.find('div',{'id' : div_id})  # It finds the first div in the soup object
                                                 # with id 'mw-content-text'.

if specific_div :  # This line checks if a div with the specified ID was found before proceeding.
    some_text = specific_div.find_all(['p','h1','h2','h3']) # This line of code searches within the specific_div
                                                            # object to find all paragraph (<p>) and heading
                                                            # (<h1>, <h2>, <h3>) tags and stores them in a list
                                                            # called some_text.

    for element in some_text :  # This code iterates through each element in 'some_text' and prints the text
        print(element.get_text())  # content of each element.

        tables = specific_div.find_all('table') # This line of code searches within the specific_div object to
                                                # find all the <table> tags (using method 'find_all') and returns
                                                # a list of all the tables found.

        for index, table in enumerate(tables, start = 1) :  # This line iterates through the tables list,
                                                            # assigning an index starting from 1 to each table.
            rows = table.find_all('tr')  # Finds all the table rows (<tr> elements) within a table.
            print(f"Table{index}:")  # This line prints "Table" followed by the current table's index number.

            for row in rows :  # Iterates through each row (<tr>) in a table,
                columns = row.find_all(['th','td'])  # This line finds all table header (<th>) and data cells (<td>)
                                                     # within a row (<tr>).
                row_data = [col.get_text().strip() for col in columns]  # This line extracts text from table cells
                                                                        # and removes leading/trailing whitespace
                                                                        # using a list comprehension.
                print(row_data)

else :
    print("Div not found")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
['2011', '', 'Iron Man 2[141][139]']
['The Incredible Hulk[141]']
['A Funny Thing...[141][83]']
['Thor[141]']
['The Consultant[141][83]']
['2012', '', 'The Avengers[142]']
['Item 47[116]']
['2013', '', 'The Dark World[142]']
['Iron Man 3[139][143]']
['2014', '', 'All Hail the King[118]']
['The Winter Soldier[139][143]']
['Guardians of the Galaxy[144]']
['I Am Groot ep. 1[145]']
['Guardians of the Galaxy Vol. 2[146]']
['I Am Groot eps. 2–10[145][147]']
['2015', '', 'Daredevil season 1[148][149][150]']
['Jessica Jones season 1[148][149][150]']
['Age of Ultron[139]']
['Ant-Man[139][151]']
['Daredevil season 2[149][150]']
['Luke Cage season 1[148][152][150]']
['2016', '', 'Iron Fist season 1[148][150]']
['The Defenders[148][150]']
['Civil War[139][153]']
['Black Widow[154]']
['Black Panther[155]']
['Homecoming[156]']
['The Punisher season 1[148][150]']
['2016–2017', '', 'Doctor Strange[157][158]']
['2017', '', 'Jessica Jones 