# Part 4: Step by Step from the Very Beginning
Follow the instructions in the Markdown cells to fill in the code cells below. We are going to build up a function to scrape an HTML file using Beautiful Soup, a few lines at a time.

## 1. Import modules, installing any missing libraries
This one already done for you.
Beautiful Soup should already be installed. If not then type 
`pip install beautifulsoup4` into a terminal window.

In [None]:
import re
import csv
import json
from bs4 import BeautifulSoup

## 2. Open the HTML file for scraping
Use a `with` statement to open the `Spring2018ClassSchedules.html` file and print out the first line to the screen. 

In [None]:
with open("Spring2018ClassSchedules.html") as fp:
    print(fp.readline())

## 3. Scrape out the HTML table rows
Use beautiful soup to select and print out the first 5 table rows, one row at a time. Take a moment to guess the structure of the HTML from the print out. What is the difference between dddheader and dddefault? WHat about <tr>, <td>, and <th>?

In [None]:
with open("Spring2018ClassSchedules.html") as fp:
    # create an HTML parse tree for the document; we can then select specific subtrees
    soup = BeautifulSoup(fp,'html.parser')
    
    # select all subtrees that represent a table row
    # the find and find_all methods are defined in the Beautiful Soup docs
    data_display_table_rows = soup.find('table',class_='datadisplaytable').find_all('tr')
    
    # print the first 5 table rows
    print(data_display_table_rows[:5])

## 4. Scrape out the course CRNs 
Use a selector to just print the CRN column in the first 5 rows. Based on the HTML above, it appears that the CRN is in the second column, just after the open/closed status.  

In [None]:
with open("Spring2018ClassSchedules.html") as fp:
    # create an HTML parse tree for the document; we can then select specific subtrees
    soup = BeautifulSoup(fp,'html.parser')
    
    # select all subtrees that represent a table row
    # the find and find_all methods are defined in the Beautiful Soup docs
    data_display_table_rows = soup.find('table',class_='datadisplaytable').find_all('tr')
    
    for row in data_display_table_rows[:5]:
        
        # select the data columns; result is [] if no data columns
        # the .dddefault css class is used in the HTML to indicate normal (non-header) table cells
        cols = row.select("td.dddefault")
        
        if (cols):
            # the string attribute represents the visible text content (i.e., not HTML markup)
            # crn is in col 1 in the cols list
            # if the crn is missing then we get '\xa0' instead of a blank string
            print(cols[1].string.strip('\xa0'))

## 3. Scrape out the other columns for any row that has a CRN
This is mostly formulaic, just like CRN but for the other columns. The scraped data record is then returned as a dictionary.

In [None]:
with open("Spring2018ClassSchedules.html") as fp:
    # create an HTML parse tree for the document; we can then select specific subtrees
    soup = BeautifulSoup(fp,'html.parser')
    
    # select all subtrees that represent a table row
    # the find and find_all methods are defined in the Beautiful Soup docs
    data_display_table_rows = soup.find('table',class_='datadisplaytable').find_all('tr')
    
    for row in data_display_table_rows[:5]:
    
        # select the data columns; result is [] if no data columns
        # the .dddefault css class is used in the HTML to indicate normal (non-header) table cells
        cols = row.select("td.dddefault")
        
        # initialize a blank dictionary for the course data
        course_spec={}
        if (cols and cols[1].string.strip('\xa0')):
            course_spec['crn'] = cols[1].string
            course_spec['catalogid'] = cols[2].string + " " + cols[3].string
            course_spec['timecodes'] = [cols[8].string+" "+cols[9].string+" "+cols[17].string]
            course_spec['section'] = cols[4].string
            course_spec['credits'] = cols[6].string
            course_spec['title'] = cols[7].string
            course_spec['instructor'] = cols[16].get_text()[:-4]
            
            print(course_spec)

## 4. Deal with extra timecodes.
This requires us to look at multiple lines, with the extra timecodes listed with a blank CRN. (See AC 204 for a few examples.) So, we'll accumulate a list of all course_specs instead of printing them one line at a time. That lets us add the extra timecodes to existing courses as we find them. 

In [None]:
with open("Spring2018ClassSchedules.html") as fp:
    # create an HTML parse tree for the document; we can then select specific subtrees
    soup = BeautifulSoup(fp,'html.parser')
    
    # select all subtrees that represent a table row
    # the find and find_all methods are defined in the Beautiful Soup docs
    data_display_table_rows = soup.find('table',class_='datadisplaytable').find_all('tr')
    
    # The list of courses scraped from the file
    course_specs = []
    
    # Each row is for a single course, but there may be extra lines for multiple timecodes
    for row in data_display_table_rows[:30]:
    
        # select the data columns; result is [] if no data columns
        # the .dddefault css class is used in the HTML to indicate normal (non-header) table cells
        cols = row.select("td.dddefault")
        
        # cols is empty for table header rows
        if cols:
            
            # cols[1] is empty for "extra timecode" rows
            crn = cols[1].string.strip('\xa0')
            if crn:
                # the normal case
                
                # pick off the columns and stuff into a dict
                course= {
                    'crn' : crn,
                    'catalogid' : cols[2].string + " " + cols[3].string,
                    'timecodes' : [cols[8].string+" "+cols[9].string+" "+cols[17].string],
                    'section' : cols[4].string,
                    'credits' : cols[6].string,
                    'title' : cols[7].string,
                    'instructor' : cols[16].get_text()[:-4]
                }
                
                # add the dict to the course_specs list
                course_specs += [course]
                
            else:
                # the extra timecodes case
                print(len(course_specs))
                course_specs[-1]['timecodes'] += [cols[8].string+" "+cols[9].string+" "+cols[17].string]

course_specs

## 5. Clean up potential data anomalies
There are two issues here
- CRNs are integers, not strings
- In columns without data Beautiful Soup will return a nonbreaking space (character '\xa0'). We stripped this out for the CRN above but it can also happen in other columns. We should strip it out in all the other columns too. 

In [None]:
with open("Spring2018ClassSchedules.html") as fp:
    # create an HTML parse tree for the document; we can then select specific subtrees
    soup = BeautifulSoup(fp,'html.parser')
    
    # select all subtrees that represent a table row
    # the find and find_all methods are defined in the Beautiful Soup docs
    data_display_table_rows = soup.find('table',class_='datadisplaytable').find_all('tr')
    
    # The list of courses scraped from the file
    course_specs = []
    
    # Each row is for a single course, but there may be extra lines for multiple timecodes
    for row in data_display_table_rows[:20]:
    
        # select the data columns; result is [] if no data columns
        # the .dddefault css class is used in the HTML to indicate normal (non-header) table cells
        cols = row.select("td.dddefault")
        
        # cols is empty for table header rows
        if cols:
            
            # cols[1] is empty for "extra timecode" rows
            crn = cols[1].string.strip('\xa0')
            if crn:
                # the normal case
                
                # pick off the columns and stuff into a dict
                course= {
                    'crn' : int(crn),
                    'catalogid' : (cols[2].string + " " + cols[3].string).strip('\xa0'),
                    'timecodes' : [(cols[8].string+" "+cols[9].string+" "+cols[17].string).strip('\xa0')],
                    'section' : cols[4].string.strip('\xa0'),
                    'credits' : cols[6].string.strip('\xa0'),
                    'title' : cols[7].string.strip('\xa0'),
                    'instructor' : (cols[16].get_text()[:-4]).strip('\xa0')
                }
                
                # add the dict to the course_specs list
                course_specs += [course]
                
            else:
                # the extra timecodes case
                print(len(course_specs))
                course_specs[-1]['timecodes'] += [(cols[8].string+" "+cols[9].string+" "+cols[17].string).strip('\xa0')]

course_specs

## 6. The finished code
The `course_schedules_beautiful_soup.py` module (in this folder) has a slightly more refined version of ths code that ...
- wraps the scraping code above into a function called  `scrape_banner_course_schedule()` 
- provides a `banner_col` dict that maps column names to column numbers (just in case Banner changes column order)
- defines a `json_dump()` utility function that we can use to be 100% sure that the data in JSON-formatted if needed
- cleans up extraneous spaces in the timecodes
- includes (commented out) test code so that the file can be run from the command line or a debugger 

In [None]:
import course_schedules_beautiful_soup as csbs
course_offerings = csbs.scrape_banner_course_schedule('Spring2018ClassSchedules.html')
csbs.json_dump(course_offerings,"Spring2018ClassSchedules.json")