# Web Scraping of Largest Companies by Revenue

This notebook demonstrates how to scrape data from the Wikipedia page on the largest companies by revenue: [List of largest companies by revenue](https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue). We will process the data to handle complex HTML structures, such as tables with merged cells (`rowspan`), and extract the relevant information into a clean tabular format.

## 1. **Imports and Initial Setup**
In the first step, we import the necessary libraries: `BeautifulSoup` for parsing HTML and `requests` for making the HTTP request to fetch the page content.


In [1]:
from bs4 import BeautifulSoup as bs
import requests

## 2. **Fetching the Web Page**
We fetch the content of the Wikipedia page using `requests` and parse it with BeautifulSoup.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
header = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, headers=header, timeout=10)
soup = bs(page.content, 'html.parser')

- `url`: The target page URL.
- `header`: A `User-Agent` header to simulate a browser request.
- `requests.get`: Fetches the page content.
- `BeautifulSoup`: Parses the HTML content for easy navigation.

## **3. Parsing the Table and Handling Rowspan**
The most complex part of this task is handling the `rowspan` attribute. This attribute allows cells to span across multiple rows, which can lead to misaligned data. We handle this by keeping track of rowspans and filling in the missing data where necessary.



### Extracting Table Data

In [3]:
table = soup.find_all('table', class_='wikitable')[0]
rows = table.find_all('tr')

headers = [header.get_text(strip=True) for header in rows[0].find_all('th')]
max_cols = len(headers)

`soup.find_all`: Finds all tables with the `wikitable` class.
`rows[0]`: Extracts the headers of the table.
`max_cols`: The maximum number of columns based on the header row.

### Handling Rowspan

In [4]:
countries = []
rowspans = []
len_data = []

for row_idx, row in enumerate(rows[2:]):
    tmp_data = [None] * max_cols  # prefill dengan None
    actual_cols = row.find_all(['th', 'td'])
    col_cursor = 0

    # Pasang data dari rowspan aktif
    for data in rowspans:
        if data['total_rows'] >= 0:
            tmp_data[data['col_index']] = data['value']

    for col in actual_cols:
        # Skip kolom yang sudah terisi dari rowspan
        while tmp_data[col_cursor] is not None:
            col_cursor += 1

        value = col.get('data-sort-value', col.get_text(strip=True))
        tmp_data[col_cursor] = value

        if col.has_attr('rowspan'):
            rowspans.append({
                'col_index': col_cursor,
                'total_rows': int(col['rowspan']) - 1,
                'value': value
            })

        col_cursor += 1

    # Update rowspans: kurangi total_rows, hapus jika selesai
    for data in rowspans[:]:
        data['total_rows'] -= 1
        if data['total_rows'] < 0:
            rowspans.remove(data)

    countries.append(tmp_data)

- `rowspans`: A list to store the state of active rowspans.
- `tmp_data`: A temporary list to hold the data for each row, with `None` as placeholders for missing values.
- `col_cursor`: Tracks the current column index to prevent overwriting cells.

### Final Table Extraction
We extract the final cleaned data and print it out for inspection.

In [5]:
print(headers)
for country in countries:
    print(country)

['Rank', 'Name', 'Industry', 'Revenue', 'Profit', 'Employees', 'Headquarters[note 1]', 'State-owned', 'Ref.']
['1', 'Walmart', 'Retail', '$680,985', '$19,436', '2,100,000', 'United States', 'No', '[1]']
['2', 'Amazon', 'Retailinformation technology', '$637,959', '$59,248', '1,556,000', 'United States', 'No', '[4]']
['3', 'State Grid Corporation of China', 'Electricity', '$545,948', '$9,204', '1,361,423', 'China', 'Yes', '[5]']
['4', 'Saudi Aramco', 'Oil and gas', '$480,446', '$106,246', '73,311', 'Saudi Arabia', 'Yes', '[6]']
['5', 'China Petrochemical Corporation', 'Oil and gas', '$429,700', '$9,393', '513,434', 'China', 'Yes', '[7]']
['6', 'China National Petroleum Corporation', 'Oil and gas', '$476,000', '$25,250', '1,026,301', 'China', 'Yes', '[8]']
['7', 'UnitedHealth Group', 'Healthcare', '$400,278', '$14,405', '400,000', 'United States', 'No', '[9]']
['8', 'Apple', 'Information technology', '$391,035', '$93,736', '164,000', 'United States', 'No', '[10]']
['9', 'Berkshire Hathawa

## **4. Converting Data to DataFrame**
Finally, we convert the cleaned data into a pandas DataFrame for easier manipulation and analysis.

In [6]:
import pandas as pd

df = pd.DataFrame(countries, columns=headers)
df

Unnamed: 0,Rank,Name,Industry,Revenue,Profit,Employees,Headquarters[note 1],State-owned,Ref.
0,1,Walmart,Retail,"$680,985","$19,436",2100000,United States,No,[1]
1,2,Amazon,Retailinformation technology,"$637,959","$59,248",1556000,United States,No,[4]
2,3,State Grid Corporation of China,Electricity,"$545,948","$9,204",1361423,China,Yes,[5]
3,4,Saudi Aramco,Oil and gas,"$480,446","$106,246",73311,Saudi Arabia,Yes,[6]
4,5,China Petrochemical Corporation,Oil and gas,"$429,700","$9,393",513434,China,Yes,[7]
5,6,China National Petroleum Corporation,Oil and gas,"$476,000","$25,250",1026301,China,Yes,[8]
6,7,UnitedHealth Group,Healthcare,"$400,278","$14,405",400000,United States,No,[9]
7,8,Apple,Information technology,"$391,035","$93,736",164000,United States,No,[10]
8,9,Berkshire Hathaway,Financials,"$371,433","$88,995",392400,United States,No,[11]
9,10,CVS Health,Healthcare,"$357,776","$8,344",259500,United States,No,[12]


- `pandas`: A powerful data analysis library that allows easy handling of tabular data.
- `DataFrame`: Converts the list of countries into a DataFrame.

## **5. Conclusion**
This notebook demonstrates how to scrape and process a complex table from a Wikipedia page, handling merged cells (`rowspan`) and presenting the data in a clean format using pandas. The final output is a DataFrame containing the largest companies by revenue, ready for analysis or further processing.

### Possible Improvements:
- Add error handling (e.g., for network failures or parsing errors).
- Extend the notebook to scrape additional pages or handle other table structures.
- Add visualizations for better insights (e.g., plotting the top companies by revenue).