### In-Class Assignment: Web Scraping and Data Extraction from a New Webpage
Use the requests library to fetch a new webpage.
Parse the HTML content using BeautifulSoup.
Extract various elements such as figures, tables, and text.
Work collaboratively in groups to practice web scraping and present their findings.
- Task 1: Select a Webpage of interest (e.g., a news article, an educational resource, or a data-driven website). Ensure that the selected webpage contains a variety of elements, such as tables, figures, and text content.
- Task 2: Fetch and Parse the Webpage

In [1]:
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_college_athletic_programs_in_Kentucky'
response = requests.get(url)

In [2]:

# Check if the request was successful
if response.status_code == 200:
    print("Successfully fetched the webpage!")
else:
    print("Failed to fetch the webpage.")

Successfully fetched the webpage!


In [3]:

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

### Task 3: Extract Elements

In [8]:
#Find all images and extract their src attributes.
images = soup.find_all('img')
image_urls = [img['src'] for img in images if 'src' in img.attrs]
print(image_urls)

['/static/images/icons/wikipedia.png', '/static/images/mobile/copyright/wikipedia-wordmark-en.svg', '/static/images/mobile/copyright/wikipedia-tagline-en.svg', '//upload.wikimedia.org/wikipedia/commons/thumb/8/84/USA_Kentucky_location_map.svg/400px-USA_Kentucky_location_map.svg.png', '//upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Green_pog.svg/8px-Green_pog.svg.png', '//upload.wikimedia.org/wikipedia/commons/thumb/0/0c/Red_pog.svg/8px-Red_pog.svg.png', '//upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Blue_pog.svg/8px-Blue_pog.svg.png', '//upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Blue_pog.svg/8px-Blue_pog.svg.png', '//upload.wikimedia.org/wikipedia/commons/thumb/0/0c/Red_pog.svg/8px-Red_pog.svg.png', '//upload.wikimedia.org/wikipedia/commons/thumb/0/0c/Red_pog.svg/8px-Red_pog.svg.png', '//upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Green_pog.svg/8px-Green_pog.svg.png', '//upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Blue_pog.svg/8px-Blue_pog.svg.png', '//up

In [1]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_college_athletic_programs_in_Kentucky"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

images = soup.find_all('img')
for img in images:
    img_url = img.get('src')
    print(img_url)


/static/images/icons/wikipedia.png
/static/images/mobile/copyright/wikipedia-wordmark-en.svg
/static/images/mobile/copyright/wikipedia-tagline-en.svg
//upload.wikimedia.org/wikipedia/commons/thumb/8/84/USA_Kentucky_location_map.svg/400px-USA_Kentucky_location_map.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Green_pog.svg/8px-Green_pog.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/0/0c/Red_pog.svg/8px-Red_pog.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Blue_pog.svg/8px-Blue_pog.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Blue_pog.svg/8px-Blue_pog.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/0/0c/Red_pog.svg/8px-Red_pog.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/0/0c/Red_pog.svg/8px-Red_pog.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Green_pog.svg/8px-Green_pog.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Blue_pog.svg/8px-Blue_pog.svg.png
//upload.wikimedia.org/wikipedia/commons/t

In [17]:
# Run to see image

import requests
from PIL import Image
from io import BytesIO

# Corrected URL of the image
img_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/USA_Kentucky_location_map.svg/400px-USA_Kentucky_location_map.svg.png"

# Fetch the image
response = requests.get(img_url)

# Check if the content type is an image
if 'image' in response.headers['Content-Type']:
    img = Image.open(BytesIO(response.content))
    # Display the image
    img.show()
else:
    print("The URL does not point to an image.")


In [5]:
# Locate and extract all tables on the webpage, converting them into Pandas DataFrames.
import pandas as pd

tables = soup.find_all('table')
for i, table in enumerate(tables):
    df = pd.read_html(str(table))[0]
    print(f"Table {i+1}:\n", df.head(), "\n")

Table 1:
                         Team                       School        City  \
                        Team                       School        City   
                        Team                       School        City   
0         Bellarmine Knights        Bellarmine University  Louisville   
1  Eastern Kentucky Colonels  Eastern Kentucky University    Richmond   
2          Kentucky Wildcats       University of Kentucky   Lexington   
3       Louisville Cardinals     University of Louisville  Louisville   
4      Morehead State Eagles    Morehead State University    Morehead   

  Conference Sport sponsorship                                                  
  Conference        Foot- ball Basketball     Base- ball Soft- ball Soccer      
  Conference        Foot- ball          M   W Base- ball Soft- ball      M   W  
0       ASUN               [a]        NaN NaN        NaN        NaN    NaN NaN  
1       ASUN           FCS [b]        NaN NaN        NaN        NaN    NaN NaN  


In [10]:
table=tables[0] if tables else None
if table:
    print("Table found")
else:
    print("No table found")

Table found


In [11]:
df=pd.read_html(str(table))[0]
df.head()

Unnamed: 0_level_0,Team,School,City,Conference,Sport sponsorship,Sport sponsorship,Sport sponsorship,Sport sponsorship,Sport sponsorship,Sport sponsorship,Sport sponsorship
Unnamed: 0_level_1,Team,School,City,Conference,Foot- ball,Basketball,Basketball,Base- ball,Soft- ball,Soccer,Soccer
Unnamed: 0_level_2,Team,School,City,Conference,Foot- ball,M,W,Base- ball,Soft- ball,M,W
0,Bellarmine Knights,Bellarmine University,Louisville,ASUN,[a],,,,,,
1,Eastern Kentucky Colonels,Eastern Kentucky University,Richmond,ASUN,FCS [b],,,,,,
2,Kentucky Wildcats,University of Kentucky,Lexington,SEC,FBS,,,,,[c],
3,Louisville Cardinals,University of Louisville,Louisville,ACC,FBS,,,,,,
4,Morehead State Eagles,Morehead State University,Morehead,OVC,FCS [d],,,,,,


In [6]:
#Extract the main text content, such as paragraphs or headings.
paragraphs = soup.find_all('p')
text_content = ' '.join([para.get_text() for para in paragraphs])
print(text_content[:500])  # Print the first 500 characters


This is a list of college athletics programs in the U.S. state of Kentucky.
 



### Task 4: Analyze and Discuss Findings
Each group will analyze the extracted data and discuss the following:
- What figures (images) were extracted and what do they represent?
- What information is contained in the tables, and how does it contribute to the overall content of the webpage?
- What is the main focus of the text content extracted? How does it relate to the images and tables?
- Discuss the challenges faced during extraction, such as dealing with complex HTML structures or incomplete data.

### Task 5: Present Findings
Shares your analysis of the extracted elements.
Discusses any patterns, relationships, or insights gained from the data.

Each group should submit their Jupyter notebook (or Python script) with the code, analysis, and any additional notes or reflections on the exercise.

The image extracted is an image of the state of Kentucky. 
The table contains a list of D1 schools in the state of Kentucky and the characteristics. The table is further discussed throughout the webpage.
The main focus is the list of D1 schools, city, and conference. The image relates because it is a map of the state of Kentucky.
A challenge we faced was that we could not extract the exact image from Wikipedia becuase it has so many layers and links. 