# NYC federal funding, fiscal year 2025

We will be scraping preliminary budget reports per agency from the [New York City Council website](https://council.nyc.gov/budget/fy2025/).

In [1]:
# importing libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
# requesting URL
url = "https://council.nyc.gov/budget/fy2025/"
response = requests.get(url)
response

<Response [403]>

In [3]:
# headers to get past the 403 error

def get_session():
    """
    This function creates a session with common headers.
    """
    session = requests.Session()
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1"
    })
    return session

In [4]:
session = get_session()
response = session.get(url)
response

<Response [200]>

In [5]:
# soupifying HTML
soup = BeautifulSoup(response.text, "html.parser")

### Scraping PDF download links

Viewing the source code, we find that our target section is under the `preliminary-budget-reports` id of an `h2` tag. However, there is no `div`, attribute, or block that holds the links we need. 

So, we will find the `a` tags that come *after* our target section but *before* the next `h2` tag.

In [6]:
def extract_target_pdfs(soup, header_text, replace_text):
    """
    This function extracts PDF links that appear after an h2 tag and before the next h2 tag.

    Parameters:
    soup: BeautifulSoup object of the HTML
    header_text (dict): attributes like id or class to identify the target h2 tag
    replace_text (str): any text to remove or delete in the PDF title

    Returns:
    pdf_dict: holds all the PDF titles and download links found in the target section
    """

    target_section = soup.find("h2", attrs=header_text)
    not_target_section = target_section.find_next("h2")

    if not_target_section:
        pdf_links = [
            a_tag.get("href") for a_tag in target_section.find_all_next("a", href=lambda x: x.lower().endswith(".pdf"))
            if a_tag.find_previous("h2") == target_section
        ]

        links_text = [
            a_tag.text.replace(replace_text, "") for a_tag in target_section.find_all_next("a", href=lambda x: x.lower().endswith(".pdf"))
            if a_tag.find_previous("h2") == target_section
        ]

    else:
        pdf_links = [
            a.get("href") for a in target_section.find_all_next("a", href=lambda x: x.lower().endswith(".pdf"))
        ]

        links_text = [
            a_tag.text.replace(replace_text, "") for a_tag in target_section.find_all_next("a", href=lambda x: x.lower().endswith(".pdf"))
        ]

    pdf_dict = {
        "agency": links_text,
        "link": pdf_links
    }
    
    return pdf_dict

In [7]:
target_links = extract_target_pdfs(soup, {"id": "preliminary-budget-reports"}, " (PDF)")
target_links

{'agency': ['February 2024 Economic and Tax Revenue Forecast ',
  'FY25 Financial Plan Overview',
  'Financial Plan Overview Dashboard',
  'Preliminary Capital Budget Report Dashboard',
  'Administration for Children’s Services',
  'Board of Correction',
  'Board of Elections',
  'City Commission on Human Rights',
  'City University of New York',
  'Civilian Complaint Review Board',
  'Department for the Aging',
  'Department of Buildings',
  'Department of City Planning',
  'Department of Citywide Administrative Services',
  'Department of Correction',
  'Department of Consumer and Worker Protection',
  'Department of Cultural Affairs',
  'Department of Design and Construction',
  'Department of Education',
  'Department of Environmental Protection',
  'Department of Finance',
  'Department of Health and Mental Hygiene',
  'Department of Homeless Services',
  'Department of Housing Preservation and Development',
  'Department of Information, Technology and Telecommunications',
  'Depa

### Saving the scraped data into a `df` then a `csv`

In [8]:
df = pd.DataFrame(target_links)
df

Unnamed: 0,agency,link
0,February 2024 Economic and Tax Revenue Forecast,https://council.nyc.gov/budget/wp-content/uplo...
1,FY25 Financial Plan Overview,https://council.nyc.gov/budget/wp-content/uplo...
2,Financial Plan Overview Dashboard,https://council.nyc.gov/budget/wp-content/uplo...
3,Preliminary Capital Budget Report Dashboard,https://council.nyc.gov/budget/wp-content/uplo...
4,Administration for Children’s Services,https://council.nyc.gov/budget/wp-content/uplo...
5,Board of Correction,https://council.nyc.gov/budget/wp-content/uplo...
6,Board of Elections,https://council.nyc.gov/budget/wp-content/uplo...
7,City Commission on Human Rights,https://council.nyc.gov/budget/wp-content/uplo...
8,City University of New York,https://council.nyc.gov/budget/wp-content/uplo...
9,Civilian Complaint Review Board,https://council.nyc.gov/budget/wp-content/uplo...


In [9]:
df.to_csv("preliminary-budget-reports-links.csv", encoding="UTF-8", index=False)