# Community full board scrapper 

All community boards have separate sites for their minutes 

> I watched Soma's youtube tutorial https://www.youtube.com/watch?v=QNKxzkNpsko


I chatted with Soma, and he said he imagined each community board to have one scrapper, and it's better to have a minimum viable product since it looks more impressive on your CV than attempting and learning something even if it fails. 

## Setup: Import what you'll need to scrape the page

We'll be using either Playwrightfor this, *not* requests.

## Starting your search

Starting from [here](https://www.nyc.gov/site/bronxcb1/calendar/board-meeting-minutes.page), search for all community board meeting minutes from Community Board 1 Bronx

# Finding all minutes

In [24]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the webpage to scrape
url = "https://www.nyc.gov/site/bronxcb1/calendar/board-meeting-minutes.page"

# Fetch the HTML content of the page
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    html_content = response.text
    # Parse the HTML content with BeautifulSoup
    soup_doc = BeautifulSoup(html_content, "html.parser")
    # Find all divs with class "span6 about-description"
    divs = soup_doc.find_all("div", class_="span6 about-description")
    
    # Lists to store the extracted data
    dates = []
    urls = []

    # Process each div to extract dates and URLs
    for div in divs:
        # Extract all links within the div
        links = div.find_all("a", href=True)
        for link in links:
            # Append the full URL (combine with base URL if relative) and the corresponding text
            url = f"https://www.nyc.gov{link['href']}"  # Assuming the href is a relative path
            text = link.text.strip()
            urls.append(url)
            dates.append(text)
    
    # Create a DataFrame
    df = pd.DataFrame({
        "Date": dates,
        "URL": urls
    })

    # Save the DataFrame to a CSV file
    output_file = "Bronx_CB1.csv"
    df.to_csv(output_file, index=False)
    print(f"Data saved to {output_file}")

else:
    print(f"Failed to fetch the webpage. Status code: {response.status_code}")


Data saved to Bronx_CB1.csv


In [14]:
import pandas as pd
df.head()


Unnamed: 0,Date,URL
0,"June 20, 2024 Minutes",https://www.nyc.gov/assets/bronxcb1/pdf/minute...
1,"May 30, 2024 Minutes",https://www.nyc.gov/assets/bronxcb1/pdf/minute...
2,"April 30, 2024 Minutes",https://www.nyc.gov/assets/bronxcb1/pdf/minute...
3,"March 28, 2024 Minutes",https://www.nyc.gov/assets/bronxcb1/pdf/minute...
4,"February 29, 2024 Minutes",https://www.nyc.gov/assets/bronxcb1/pdf/minute...


# Export the df 

I stole this code from the internet and fixed it up, link [here](https://www.geeksforgeeks.org/downloading-pdfs-with-python-using-requests-and-beautifulsoup/)

In [16]:
!pip install pdf2image

Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.17.0


In [18]:
!brew install poppler

[34m==>[0m [1mAuto-updating Homebrew...[0m
Adjust how often this is run with HOMEBREW_AUTO_UPDATE_SECS or disable with
HOMEBREW_NO_AUTO_UPDATE. Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
[34m==>[0m [1mAuto-updated Homebrew![0m
Updated 3 taps (homebrew/services, homebrew/core and homebrew/cask).
[34m==>[0m [1mNew Formulae[0m
azure-core-cpp             poselib                    weaviate
cobra-cli                  protoc-gen-grpc-java
libunicode                 templ
[34m==>[0m [1mNew Casks[0m
font-ibm-plex-sans-sc                    steinberg-mediabay
steinberg-download-assistant             wizcli
steinberg-library-manager

You have [1m10[0m outdated formulae installed.

[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/poppler/manifests/24.11.0[0m
######################################################################### 100.0%
[32m==>[0m [1mFetching dependencies for poppler: [32mlibgpg-error[39m, [32mlibassuan[39m, [32mlibgcryp

In [25]:
import os
import requests
import pandas as pd
import pdfplumber
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
from pdf2image import convert_from_path
from google.cloud import vision
from PIL import Image
import pytesseract

# Initialize DocTR OCR Predictor
ocr_model = ocr_predictor(pretrained=True)

# Google Vision OCR Function
def extract_text_with_google_vision(pdf_path):
    client = vision.ImageAnnotatorClient()
    text = ""
    try:
        # Convert PDF pages to images and send to Google Vision API
        pages = convert_from_path(pdf_path)
        for page in pages:
            # Convert the image to bytes
            image_bytes = page.tobytes("raw", "RGB")
            image = vision.Image(content=image_bytes)
            response = client.text_detection(image=image)
            text += response.full_text_annotation.text + "\n"
    except Exception as e:
        print(f"Google Vision OCR failed: {e}")
        text = "Error extracting content"
    return text.strip()

# Tesseract OCR Function
def extract_text_with_tesseract(pdf_path):
    text = ""
    try:
        pages = convert_from_path(pdf_path)
        for page in pages:
            text += pytesseract.image_to_string(page) + "\n"
    except Exception as e:
        print(f"Tesseract OCR failed: {e}")
        text = "Error extracting content"
    return text.strip()

# Load the DataFrame
input_file = "Bronx_CB1.csv"
df = pd.read_csv(input_file)

# Folder to store downloaded PDFs
folder_name = "Bronx_CB1_PDFs"
os.makedirs(folder_name, exist_ok=True)

# Initialize a list to store content
content_list = []
failed_files = []

# Iterate over each row in the DataFrame
for index, row in df.iterrows():
    url = row['URL']
    date = row['Date']

    try:
        print(f"Processing: {date} ({url})")
        file_name = os.path.basename(url)
        file_path = os.path.join(folder_name, file_name)

        if not os.path.exists(file_path):
            response = requests.get(url)
            with open(file_path, 'wb') as pdf_file:
                pdf_file.write(response.content)
            print(f"Downloaded: {file_path}")

        pdf_text = ""
        # Step 1: Attempt to extract with pdfplumber
        try:
            with pdfplumber.open(file_path) as pdf:
                for page in pdf.pages:
                    pdf_text += page.extract_text() or ""
        except Exception as e:
            print(f"pdfplumber failed for {file_name}: {e}")

        # Step 2: Fallback to DocTR if pdfplumber fails
        if not pdf_text.strip():
            print(f"Using DocTR for OCR on {file_name}")
            try:
                pdf_doc = DocumentFile.from_pdf(file_path)
                ocr_result = ocr_model(pdf_doc)
                pdf_text = "\n".join(page["content"] for page in ocr_result.export()["pages"])
            except Exception as e:
                print(f"DocTR failed for {file_name}: {e}")

        # Step 3: Fallback to Tesseract OCR if DocTR fails
        if not pdf_text.strip():
            print(f"Using Tesseract OCR for {file_name}")
            pdf_text = extract_text_with_tesseract(file_path)

        # Step 4: Final fallback to Google Vision OCR if Tesseract fails
        if not pdf_text.strip():
            print(f"Using Google Vision OCR for {file_name}")
            pdf_text = extract_text_with_google_vision(file_path)

        # Log files that could not be processed
        if not pdf_text.strip():
            failed_files.append(file_name)
            pdf_text = "Manual review required"

        content_list.append(pdf_text.strip())

    except Exception as e:
        print(f"Error processing {date}: {e}")
        content_list.append("Error extracting content")

# Add the content as a new column in the DataFrame
df['Content'] = content_list
df.to_csv("Bronx_CB1_with_content.csv", index=False)

# Save failed files for manual review
with open("failed_files.log", "w") as log_file:
    log_file.write("\n".join(Bronx_CB1_failed_files))

print("Processing complete. Check 'Bronx_CB1_with_content.csv' and 'failed_files.log'.")


  state_dict = torch.load(archive_path, map_location="cpu")


Processing: June 20, 2024 Minutes (https://www.nyc.gov/assets/bronxcb1/pdf/minutes/June-24-Official-Minutes-General-Board.pdf)
Using DocTR for OCR on June-24-Official-Minutes-General-Board.pdf
DocTR failed for June-24-Official-Minutes-General-Board.pdf: 'content'
Using Tesseract OCR for June-24-Official-Minutes-General-Board.pdf
Processing: May 30, 2024 Minutes (https://www.nyc.gov/assets/bronxcb1/pdf/minutes/May-24-Official-Minutes-General-Board.pdf)
Using DocTR for OCR on May-24-Official-Minutes-General-Board.pdf
DocTR failed for May-24-Official-Minutes-General-Board.pdf: 'content'
Using Tesseract OCR for May-24-Official-Minutes-General-Board.pdf
Processing: April 30, 2024 Minutes (https://www.nyc.gov/assets/bronxcb1/pdf/minutes/April-24-Official-Minutes-General-Board.pdf)
Using DocTR for OCR on April-24-Official-Minutes-General-Board.pdf
DocTR failed for April-24-Official-Minutes-General-Board.pdf: 'content'
Using Tesseract OCR for April-24-Official-Minutes-General-Board.pdf
Process

NameError: name 'Bronx_CB1_failed_files' is not defined

# Print PDF 

I used this link from Soma's website [here](https://jonathansoma.com/everything/pdfs/ocr-tools/) 