Use github classroom. Create a repo for you to use. Create an ipython notebook in the repo and commit it. I want you to keep committing it as you edit it so I can see it build up. Your mission is to scrape the cuny Fall 2021 academic calendar site using the requests library, beautiful soup and pandas. You should wind up with a pandas data frame where the index column is a python date. There should be a column for “day of the week” with variable label dow and a column called text with the explanation.

In [81]:
import pandas as pd
import requests
from datetime import datetime
from bs4 import BeautifulSoup

In [None]:
"""
https://www.ccny.cuny.edu/registrar/fall

Structure of the calendar site:
<tbody> - contains the table
<tr> - table row
    <td><p> contents </p></td>  - Date
    <td>                        - Weekday
    <td>                        - Description

<p> contents may contain <strong></strong>
No class/ids in the table, need to cover by index

"""

In [8]:
url = 'https://www.ccny.cuny.edu/registrar/fall'
response = requests.request(url=url, method='get')
soup = BeautifulSoup(response.text)

In [None]:
"""
The index is supposed to be a python date, which means the date needs to be converted.
Unfortunately, the date is provided in single dates, ranges, and at the end a year.
For a range, let's just add two rows instead of one with the text description saying it's START or END.
"""

In [111]:
table = soup.find('tbody')
df = []
for rows in table.find_all('tr'):
    items = rows.find_all('td')

    weekday = items[1].text.strip()
    desc_s = items[2].text.strip().replace('\n', ' ').replace('\t', '')

    # Date check
    date_string = items[0].text.strip()
    month, day_s = date_string.split(' ', 1)
    year = '2021'
    if ',' in day_s:
        day_s, year = day_s.split(', ')
    
    row_e = None

    if '-' in day_s:
        day_s, day_e = day_s.split(' - ')
        datetime_e = datetime.strptime(month + day_e + year, '%B%d%Y')
        desc_e = desc_s + ' (END)'
        desc_s = desc_s + ' (START)'

        row_e = [datetime_e, weekday, desc_e]
        
    datetime_s = datetime.strptime(month + day_s + year, '%B%d%Y')
    row_s = [datetime_s, weekday, desc_s]
    
    df.append(row_s)
    if row_e:
        df.append(row_e)

df = pd.DataFrame(df, columns=['date', 'day of the week', 'text'])


In [112]:
print(df.head())

        date      day of the week  \
0 2021-08-01               Sunday   
1 2021-08-18            Wednesday   
2 2021-08-24              Tuesday   
3 2021-08-25            Wednesday   
4 2021-08-25  Wednesday - Tuesday   

                                                text  
0  Application for degree for January and Februar...  
1                 Last day to apply for Study Abroad  
2  Last day of Registration; Last day to file ePe...  
3  Start of Fall Term; Classes begin; Initial Reg...  
4  Change of program period; late fees apply (START)  


In [113]:
df = df.set_index('date')

In [114]:
df.index

DatetimeIndex(['2021-08-01', '2021-08-18', '2021-08-24', '2021-08-25',
               '2021-08-25', '2021-08-31', '2021-08-26', '2021-08-28',
               '2021-08-31', '2021-09-01', '2021-09-03', '2021-09-08',
               '2021-09-06', '2021-09-09', '2021-09-14', '2021-09-15',
               '2021-09-15', '2021-09-16', '2021-09-23', '2021-09-24',
               '2021-10-01', '2021-10-08', '2021-10-11', '2021-11-01',
               '2021-11-02', '2021-11-04', '2021-11-06', '2021-11-23',
               '2021-11-25', '2021-11-28', '2021-12-11', '2021-12-13',
               '2021-12-14', '2021-12-15', '2021-12-21', '2021-12-21',
               '2021-12-24', '2021-12-25', '2021-12-27', '2021-12-28',
               '2021-12-31', '2022-01-01'],
              dtype='datetime64[ns]', name='date', freq=None)

In [115]:
df

Unnamed: 0_level_0,day of the week,text
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-08-01,Sunday,Application for degree for January and Februar...
2021-08-18,Wednesday,Last day to apply for Study Abroad
2021-08-24,Tuesday,Last day of Registration; Last day to file ePe...
2021-08-25,Wednesday,Start of Fall Term; Classes begin; Initial Reg...
2021-08-25,Wednesday - Tuesday,Change of program period; late fees apply (START)
2021-08-31,Wednesday - Tuesday,Change of program period; late fees apply (END)
2021-08-26,Thursday,Last day for Independent Study
2021-08-28,Saturday,First day of Saturday Classes
2021-08-31,Tuesday,Last day to add a class to an existing enrollm...
2021-09-01,Wednesday,Verification of Enrollment rosters available t...
