# CUNY Fall 2021 Academic Calendar Scraper

This notebook scrapes the CUNY Fall 2021 academic calendar from the CCNY registrar website and creates a pandas DataFrame with the calendar data.

**Objective**: Create a DataFrame with:
- Index: Python date objects
- Column 'dow': Day of the week
- Column 'text': Event description


## Import Required Libraries


In [15]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import re


## Scrape the Calendar Data


In [16]:
# URL of the CUNY Fall 2021 academic calendar
url = "https://www.ccny.cuny.edu/registrar/fall"

# Send GET request to the website
response = requests.get(url)
print(f"Status Code: {response.status_code}")

# Check if request was successful
if response.status_code == 200:
    print("Successfully retrieved the webpage!")
else:
    print(f"Failed to retrieve webpage. Status code: {response.status_code}")


Status Code: 200
Successfully retrieved the webpage!


In [17]:
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the calendar table - looking for the table with calendar data
table = soup.find('table')

if table:
    print("Found calendar table!")
    # Let's examine the table structure
    rows = table.find_all('tr')
    print(f"Number of rows found: {len(rows)}")
    
    # Print first few rows to understand structure
    for i, row in enumerate(rows[:3]):
        cells = row.find_all(['td', 'th'])
        print(f"Row {i}: {len(cells)} cells")
        for j, cell in enumerate(cells):
            print(f"  Cell {j}: {cell.get_text().strip()[:50]}...")
else:
    print("No table found - let's examine the page structure")
    print("Page title:", soup.title.get_text() if soup.title else "No title found")

Found calendar table!
Number of rows found: 37
Row 0: 3 cells
  Cell 0: DATES...
  Cell 1: DAYS...
  Cell 2: ...
Row 1: 3 cells
  Cell 0: August 01...
  Cell 1: Sunday...
  Cell 2: Application for degree for January and February 20...
Row 2: 3 cells
  Cell 0: August 18...
  Cell 1: Wednesday...
  Cell 2: Last day to apply for Study Abroad...


In [18]:
## Extract Calendar Data

# Extract calendar data from the table
calendar_data = []

if table:
    rows = table.find_all('tr')
    
    # Skip header row and process data rows
    for row in rows[1:]:  # Skip the header row
        cells = row.find_all('td')
        if len(cells) >= 3:  # Ensure we have at least 3 columns
            date_text = cells[0].get_text().strip()
            day_text = cells[1].get_text().strip()
            description_text = cells[2].get_text().strip()
            
            # Only process rows with actual date information
            if date_text and day_text and description_text:
                calendar_data.append({
                    'date_text': date_text,
                    'day_text': day_text,
                    'description': description_text
                })

print(f"Extracted {len(calendar_data)} calendar entries")

# Display first few entries to verify
for i, entry in enumerate(calendar_data[:5]):
    print(f"\nEntry {i+1}:")
    print(f"  Date: {entry['date_text']}")
    print(f"  Day: {entry['day_text']}")
    print(f"  Description: {entry['description'][:100]}...")

Extracted 36 calendar entries

Entry 1:
  Date: August 01
  Day: Sunday
  Description: Application for degree for January and February 2022 begins...

Entry 2:
  Date: August 18
  Day: Wednesday
  Description: Last day to apply for Study Abroad...

Entry 3:
  Date: August 24
  Day: Tuesday
  Description: Last day of Registration;
			Last day to file ePermit for the Fall 2021;
			Last day to drop classes...

Entry 4:
  Date: August 25
  Day: Wednesday
  Description: Start of Fall Term;
			Classes begin;
			Initial Registration Appeals begin;...

Entry 5:
  Date: August 25 - 31
  Day: Wednesday - Tuesday
  Description: Change of program period; late fees apply...


In [19]:
## Parse Dates and Create DataFrame

def parse_date(date_str):
    """
    Parse date string from the calendar into a Python date object.
    Expected formats: "August 01", "August 25 - 31", "September 03 - 08", etc.
    """
    # Clean up the date string
    date_str = date_str.strip()
    
    # Handle date ranges (take the first date)
    if ' - ' in date_str:
        date_str = date_str.split(' - ')[0].strip()
    
    # Add year (2021 for Fall 2021)
    # Handle cases where month might roll over to next year
    month_name = date_str.split()[0]
    if month_name in ['January', 'February']:
        year = 2022  # These dates are in the next year
    else:
        year = 2021
    
    try:
        # Parse the date
        date_with_year = f"{date_str} {year}"
        parsed_date = datetime.strptime(date_with_year, "%B %d %Y").date()
        return parsed_date
    except ValueError as e:
        print(f"Error parsing date '{date_str}': {e}")
        return None

# Test the date parsing function with a few examples
test_dates = ["August 01", "August 25 - 31", "December 31", "January 1, 2022"]
print("Testing date parsing:")
for test_date in test_dates:
    parsed = parse_date(test_date)
    print(f"'{test_date}' -> {parsed}")

Testing date parsing:
'August 01' -> 2021-08-01
'August 25 - 31' -> 2021-08-25
'December 31' -> 2021-12-31
Error parsing date 'January 1, 2022': time data 'January 1, 2022 2022' does not match format '%B %d %Y'
'January 1, 2022' -> None


In [20]:
# Create the final DataFrame
df_data = []

print("Creating DataFrame...")
for entry in calendar_data:
    parsed_date = parse_date(entry['date_text'])
    if parsed_date:  # Only include entries with valid dates
        df_data.append({
            'date': parsed_date,
            'dow': entry['day_text'],
            'text': entry['description']
        })

# Create DataFrame with date as index
df = pd.DataFrame(df_data)
if not df.empty:
    df.set_index('date', inplace=True)
    df.sort_index(inplace=True)  # Sort by date
    
    print(f"✅ Successfully created DataFrame with {len(df)} entries")
    print("\nDataFrame Info:")
    print(df.info())
    print("\nFirst 10 entries:")
    print(df.head(10))
    print("\nLast 5 entries:")
    print(df.tail(5))
else:
    print("❌ No valid calendar data found to create DataFrame")

Creating DataFrame...
Error parsing date 'January 1, 2022': time data 'January 1, 2022 2022' does not match format '%B %d %Y'
✅ Successfully created DataFrame with 35 entries

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 35 entries, 2021-08-01 to 2021-12-31
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   dow     35 non-null     object
 1   text    35 non-null     object
dtypes: object(2)
memory usage: 840.0+ bytes
None

First 10 entries:
                            dow  \
date                              
2021-08-01               Sunday   
2021-08-18            Wednesday   
2021-08-24              Tuesday   
2021-08-25            Wednesday   
2021-08-25  Wednesday - Tuesday   
2021-08-26             Thursday   
2021-08-28             Saturday   
2021-08-31              Tuesday   
2021-09-01            Wednesday   
2021-09-03   Friday - Wednesday   

                                                         tex