## Scrape CCNY 2021 Fall Calendar

* Use:
    * requests library
    * BeautifulSoup
    * Pandas

* End up with Pandas data frame:
    * index column is python date
    * "D.O.F" Column
    * Text column with explaination.

In [None]:
!pip install pandas
!pip install beautifulsoup4
!pip install requests

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
import re

# Added the imports to start the project, will add more as needed.

1. Lets set the URL and make the make the request.

In [2]:
url = 'https://www.ccny.cuny.edu/registrar/fall'

response = requests.get(url)
html_document = response.text


2. Parsing the Html

In [3]:
soup = BeautifulSoup(html_document, 'html.parser')

table = soup.find('table') # since there was only 1 table on the site

3. Now wr parse the table 

In [4]:
data = []

for row in table.find('tbody').find_all('tr'):
    columns = row.find_all('td')
    
    if len(columns) == 3:
        # Extract date, day of the week, and text information
        date_str = (columns[0].find('strong').get_text(strip=True) if columns[0].find('strong') else
                   columns[0].find('p').get_text(strip=True) if columns[0].find('p') else
                   columns[0].get_text(strip=True))
        
        dow = (columns[1].find('strong').get_text(strip=True) if columns[1].find('strong') else
               columns[1].find('p').get_text(strip=True) if columns[1].find('p') else
               columns[1].get_text(strip=True))
        
        text = (''.join([line.get_text(strip=True) for line in columns[2].find_all(['p', 'br'])])
                if columns[2].find(['p', 'br']) else columns[2].get_text(strip=True))
        
        # Handle date ranges and different formats
        try:
            if '-' in date_str:
                # Handle date ranges, e.g., 'August 25 - 31'
                start_date_str, end_date_str = re.split(r'\s*-\s*', date_str)
                start_date = datetime.strptime(start_date_str, '%B %d').replace(year=2021)
                
                end_date = (datetime.strptime(end_date_str, '%d').replace(month=start_date.month, year=2021)
                           if len(end_date_str) <= 2 else
                           datetime.strptime(end_date_str, '%B %d').replace(year=2021))
                
                current_date = start_date
                while current_date <= end_date:
                    data.append([current_date, dow, text])
                    current_date += timedelta(days=1)
            else:
                # Handle single dates
                date = (datetime.strptime(date_str, '%B %d, %Y') if ',' in date_str
                       else datetime.strptime(date_str, '%B %d').replace(year=2021))
                data.append([date, dow, text])
                
        except ValueError as e:
            print(f"Skipping invalid date: {date_str}, error: {e}")
            continue

and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
  else datetime.strptime(date_str, '%B %d').replace(year=2021))


4. We create the dataframe

In [5]:
df = pd.DataFrame(data, columns=['date', 'dow', 'text'])
df.set_index('date', inplace=True)

print(df)

                             dow  \
date                               
2021-08-01                Sunday   
2021-08-18             Wednesday   
2021-08-24               Tuesday   
2021-08-25             Wednesday   
2021-08-25   Wednesday - Tuesday   
2021-08-26   Wednesday - Tuesday   
2021-08-27   Wednesday - Tuesday   
2021-08-28   Wednesday - Tuesday   
2021-08-29   Wednesday - Tuesday   
2021-08-30   Wednesday - Tuesday   
2021-08-31   Wednesday - Tuesday   
2021-08-26              Thursday   
2021-08-28              Saturday   
2021-08-31               Tuesday   
2021-09-01             Wednesday   
2021-09-03    Friday - Wednesday   
2021-09-04    Friday - Wednesday   
2021-09-05    Friday - Wednesday   
2021-09-06    Friday - Wednesday   
2021-09-07    Friday - Wednesday   
2021-09-08    Friday - Wednesday   
2021-09-06                Monday   
2021-09-09              Thursday   
2021-09-14               Tuesday   
2021-09-15             Wednesday   
2021-09-15  Wednesday - Thur