<a href="https://colab.research.google.com/github/ezesalvatore/BasketballScraper/blob/main/BasketballScraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BasketBall Reference Uniform Scraper

## Project overview




### Objectives
- Extract players name, team and uniform number
- Support wireframe development
- Shows my webscraping expertises

### Methods used
- The webscraper being used is Beautiful Soup, since Basketball Reference is primarily build with HTML
- This project also make sure that it follows Basketball Reference rate limiting and crawl-delay violations

## Imports

In [None]:
#Makes a request to Basketball Reference
import requests

#How we are going to webscrape, able to parse through HTML to get the name, team, and uniform
from bs4 import BeautifulSoup

#Helps me with exporting a csv file
import pandas as pd

#Allows me to add delays
import time

#Data processing and cleaning
import re

#Creates url for web scraping
from urllib.parse import urljoin

#Get rid of erros
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported")

## Session Setup

In [None]:
def setup_session():
    """
    This function is setting up the sessions for web scraping. Making sure the headers are compliant.

    """
    session = requests.Session()

    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'text/html,application/xhtml+xml',
        'Connection': 'keep-alive',
    })

    print("Session is created!")

    return session

## Fetching Basketball Reference Data

In [None]:
def fetch_basketball_data():
    """
    Make single compliant request to Basketball Reference and following the robots.txt rules.

    """
    url = "https://www.basketball-reference.com/leagues/NBA_2025_numbers.html"

    session = setup_session()


    try:
        # Robots.txt compliance: Crawl-delay: 3
        time.sleep(3.0)

        response = session.get(url, timeout=30)

        print(f"The request to: {url} has been sent out")

        if response.status_code == 429:
            raise Exception("Rate limited - blocked for 24 hours")

        response.raise_for_status()

        print(f"Server Status: {response.status_code}")

        #HTML page returned
        return response.text

    except Exception as e:
        print(f" Request failed: {e}")
        return None
    finally:
        session.close()

## Web Scraping

### Web Scraping Strategy

I will use the the CSS Selector to get the values from the html site

**Uniform Number** : `div.data_grid_box table caption`
<br>

**Player Name**: `div.data_grid_box a[href*='/players/']`
<br>

**Team**: `span.desc a[href*='/teams/'] `



## Intergrate with other csv file