<a href="https://colab.research.google.com/github/Wezz-git/AI-samples/blob/main/Web_Scraping%20(S%26P%20500).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**The Business Problem: **

You're a junior data scientist at a hedge fund. Your boss says, "I need a list of all the companies in the S&P 500, their stock tickers, and what industry they're in. Find it." There's no API for this. The data is just sitting on a Wikipedia page.

**This is Web Scraping:**

You're going to write a script that "reads" a website's HTML source code and pulls the data out.

In [2]:
# Tools:
# requests: You already know this! We'll use it to download the webpage's raw HTML.
# BeautifulSoup: This is the new library. It's a "parser" that makes it easy to navigate the messy HTML and find the exact table we need.

In [3]:
!pip install beautifulSoup
!pip install requests

Collecting beautifulSoup
  Downloading BeautifulSoup-3.2.2.tar.gz (32 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


In [21]:
import requests

url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

# This "disguise" tells the server we are a normal browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

response = requests.get(url, headers=headers)

# We just check the .status_code
# directly on the 'response' we already have.

# Should get the status code: 200, meaning its OK
print(f"Status Code: {response.status_code}")

Status Code: 200


In [11]:
# 1 - import the new tool

from bs4 import BeautifulSoup

# 2 - create the 'soup'
# This parses the raw HTML text we download

soup = BeautifulSoup(response.text, 'html.parser')

print(f"Status code: {response.status_code}")

print("Successfully created 'soup' object. HTML is parsed and ready to be searched.")

Status code: 200
Successfully created 'soup' object. HTML is parsed and ready to be searched.


In [12]:
# --- 1. Find the specific table ---
# We're telling BeautifulSoup to find a 'table' tag
# that has an id attribute of 'constituents'

table = soup.find('table', id='constituents')

# --- 2. Check if we found it ---

if table:
    print("Success! Found the S&P 500 constituents table.")
else:
    print("Error: Could not find the table. The webpage structure might have changed.")

Success! Found the S&P 500 constituents table.


In [14]:
import pandas as pd

# 1. Convert the HTML table to a DataFrame

# pd.read_html() is a powerful function that finds all tables.
# We give it str(table) to tell it to read our *specific* table.
# It returns a list of tables, so we get the first one [0].

df_list = pd.read_html(str(table))
df_sp500_df = df_list[0]

# 2 - Check the new DataFrame

print("Successfully converted the table to a DataFrame.")
print(df_sp500_df.head())

Successfully converted the table to a DataFrame.
  Symbol             Security             GICS Sector  \
0    MMM                   3M             Industrials   
1    AOS          A. O. Smith             Industrials   
2    ABT  Abbott Laboratories             Health Care   
3   ABBV               AbbVie             Health Care   
4    ACN            Accenture  Information Technology   

                GICS Sub-Industry    Headquarters Location  Date added  \
0        Industrial Conglomerates    Saint Paul, Minnesota  1957-03-04   
1               Building Products     Milwaukee, Wisconsin  2017-07-26   
2           Health Care Equipment  North Chicago, Illinois  1957-03-04   
3                   Biotechnology  North Chicago, Illinois  2012-12-31   
4  IT Consulting & Other Services          Dublin, Ireland  2011-07-06   

       CIK      Founded  
0    66740         1902  
1    91142         1916  
2     1800         1888  
3  1551152  2013 (1888)  
4  1467373         1989  


  df_list = pd.read_html(str(table))


In [18]:
# 3 - clean the data (good for practice)
# Want to get the 'Symbol', 'Security' and 'GICS Sector'

final_df = df_sp500_df[['Symbol', 'Security', 'GICS Sector']]

print("\n-- Final, Cleaned list --")
print("Successfully cleaned the data.")
print(final_df.head())


-- Final, Cleaned list --
Successfully cleaned the data.
  Symbol             Security             GICS Sector
0    MMM                   3M             Industrials
1    AOS          A. O. Smith             Industrials
2    ABT  Abbott Laboratories             Health Care
3   ABBV               AbbVie             Health Care
4    ACN            Accenture  Information Technology


Ran the entire web scraping workflow:

1 - Downloaded the page (handling 403 errors).

2 - Parsed the messy HTML (BeautifulSoup).

3 - Located the exact data you needed (soup.find).

4 - Extracted and cleaned the data into a usable format (pd.read_html).

Now have a DataFrame with all 500 companies and their tickers.