Chloe Fugle (chloe.m.fugle.23@dartmouth.edu), 6/7/22, CS72 Final Project 

A searchable database of the Dartmouth College courses and their descriptions to allow students to more easily find classes related to their interests. The search uses a TF-IDF and WordNet algorithm to return the most relevant courses.  

### Instructions:  
To run the program, select Runtime -> Run all. Please note that due to the structure of the Dartmouth Timetable website, the program will take about 15 minutes to compile a searchable database.   
To avoid this wait time, a file "course_desc.txt" has been included with the submission. This file contains the scraped and processed text. Upload the file to the Google Colab runtime using the file icon to the left and run every cell except for the cell titled "Full web-scraper."  
Once the pre-processing functions have been run, the search function (the last cell) can be run as many times as desired to search for courses. Courses are listed in order of relevance to the user's search.



In [17]:
# import necessary packages

import requests
from bs4 import BeautifulSoup
import re
import fileinput
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from IPython.display import display
from sklearn.metrics import pairwise_distances
import numpy as np
from nltk.corpus import wordnet
nltk.download('wordnet')

cell_executed = False

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [10]:
# Fetches the HTML text of the information contained within the table of courses
# on the Dartmouth Timetable, found at https://oracle-www.dartmouth.edu/dart/groucho/timetable.display_courses
#
# input: none
# output: BeautifulSoup HTML object containing HTML text of all courses offered,
#     note: this includes the URL to the course description, but not the 
#           description itself

def get_data():
    trm = '202209' #22F
    url = "https://oracle-www.dartmouth.edu/dart/groucho/timetable.display_courses"
    payload = {
        'classyear' : '2008',
        'searchtype' : 'Subject Area(s)',
        'pmode' : 'public',
        'term' : '',
        'levl' : '',
        'fys' : 'n',
        'wrt' : 'n',
        'pe' : 'n',
        'review' : 'n',
        'crnl' : 'no_value',
        'termradio' : 'allterms',
        'terms' : '202209',
        'hoursradio' : 'allhours',
        'periods' : 'no_value',
        'subjectradio' : 'allsubjects',
        'depts' : 'no_value',
        'deliveryradio' : 'alldelivery',
        'deliverymodes' : 'no_value',
        'distribs_i' : 'no_value',
        'distribs_wc' : 'no_value',
        'distribradio' : 'alldistribs',
        'distribs' : 'no_value',
        'sortorder' : 'dept'
    }

    rdata = requests.post(url, data = payload) #sends search criteria to Timetable, recieves HTML
    soup = BeautifulSoup(rdata.text, features="lxml") #initializes LXML from above as soup object for searching

    datatable = soup.find('div', class_= 'data-table')
    
    return datatable

In [11]:
# Extracts course names and descriptions from the HTML text, creates a file
# containing one course name (whitespace) and course description on each line.
# Follows the link provided for each course to scrape the course description.
#   note: due to the structure of the Dartmouth Timetable, this function 
#         takes about 14 min to run
#
# input: BeautifulSoup HTML object containing the text of all courses offered
# output: file ("class-desc.txt") and dictionary containing the course and course descriptions

def make_list_file(datatable):

    timetable = open("timetable.txt", "w+")
    timetable.write(datatable.text)
    timetable.close()
    
    # group information about each course together in a dictionary
    i = 0   # file line number
    j = -1  # current class line number
    match = re.compile(r'202209').search
    class_dict = {}
    name_num_dict = {}
    temp_list = []
    class_name = ""
    timetable = open("timetable.txt", "r")

    for line in timetable:
      line = re.sub(r'[^a-zA-Z0-9_ ]', '', line) # remove all non-alphanumberic characters
      if i<25:  # ignore first 25 lines of file
        i += 1
        continue
      else:
        if j == -1 and match(line): # course is held in specified term
          j += 1
          continue
        if j >= 0 and j < 30:  # line contains information about relevant course
          if j != 14:   # course name found on line 14 of class information
            if len(line.strip()) != 0:
              temp_list.append(line)
            if j == 1:    # department 4-letter code
              dept_code = line
            if j == 2:    # course number
              course_num = line
            j += 1
          else:
            class_name = line
            j += 1
          if j == 29:   # each class contains 29 lines of information
            class_dict[class_name] = temp_list
            name_num_dict[class_name] = str(dept_code) + " " + str(course_num)
            j = -1    # end of class information, reset count
      i += 1

    timetable.close()

    # for every class in class dictionary, get the link to the course information
    classdesc_dict = {}
    link_list = datatable.find_all('a')   # get all <a> tags within the datatable

    for classname in class_dict:  # for every class, get link
      regex = re.compile(r'>'+re.escape(classname)+r'<')
      for i in link_list:
        if re.search(regex, str(i)):
          new = re.sub(r'&amp;', r'&', str(i))  # fix link
          newlink = BeautifulSoup(new)   # initialize text to search for link
          justlink = newlink.find('a')['href'].split("'")[1]  # get just the link
          
          # get course description text from link
          rdata = requests.post(justlink)
          soup = BeautifulSoup(rdata.text, features="lxml")
          text = str(soup.find_all('p'))

          # make the stripped text pretty
          if len(text.split("<p>")) > 3:
            temp = text.split("<p>")[3]
            if len(temp.split("</p>")) > 0:
              temp = temp.split("</p>")[0]
            else:
              temp = "No course description availible"
          else:
            temp = "No course description availible"

          # add to the course description dictionary
          classdesc_dict[classname] = temp

    # create file with course names and descriptions
    course_desc = open("course_desc.txt", "w+")
    for course in classdesc_dict:
      course_desc.write(course + "\t" + classdesc_dict[course] + "\n")

    # create file containing class names and numbers
    course_nums = open("course_nums.txt", "w+")
    for course in name_num_dict:
      course_nums.write(course + "\t" + name_num_dict[course] + "\n")
  
    return classdesc_dict

    


          
          




In [12]:
# Creates course description dictionary from file generated by make_list_file().
# It is not necessary to run this function if you've run make_list_file(). This
# function saves time if you already have the course description file.
#
# input: "course_desc.txt" generated by make_list_file(), contains one course 
#         name (whitespace) and course description on each line
# output: dictionary containing each course name as a key and its description
#         as the value

def get_dict_from_file(filename):
  filename = open(filename, "r")
  classdesc_dict = {}

  for line in filename:
    name = line.split("\t")[0]
    if len(line.split("\t")) > 1:
      desc = line.split("\t")[1]
    else:
      desc = "No course description avalible"
    classdesc_dict[name] = desc

  return classdesc_dict

In [13]:
# Vectorizes every course description using TF-IDF. Generates a dictionary
# of the course names and their sparce matrix of keywords.
# TF-IDF code modified from code found at https://towardsdatascience.com/using-tf-idf-to-form-descriptive-chapter-summaries-via-keyword-extraction-4e6fd857d190
#
# inputs: 
#   classdesc_dict: dictionary of course names and their descriptions
#   keywords: defaults to False, set to True for a printed list of courses and the
#       keywords included in their descriptions and a list of all the keywords in
#       alphabetical order
# outputs:
#   keynums_dict: dictionary of the course names and their sparce vectors
#   colnames: list of the keywords associated with each number in the sparce vector


def create_tfidf(classdesc_dict, keywords=False):

  # download a set of stopwords
  st = set(stopwords.words('english'))

  # turn dictionary into list for use in tfidf
  classlist = []
  namelist = []   # list to keep the classes in to match with the keywords
  for course in classdesc_dict:
    classlist.append(course + " " + classdesc_dict[course])
    namelist.append(course)

  # extract keywords from each course description using tfidf vectorizer
  vectorizer = TfidfVectorizer()
  vectors = vectorizer.fit_transform(classlist)
  names = vectorizer.get_feature_names()
  data = vectors.todense().tolist()
  df = pd.DataFrame(data, columns=names)
  df = df[filter(lambda x: x not in list(st) , df.columns)]
  df = df[filter(lambda x: x.isalpha() , df.columns)]


  colnames = df.columns
  keynums_dict = {}
  keywords_dict = {}
  j = 0
  for index, row in df.iterrows():
    temp_list = []
    temp_words = []
    for i in range(0, len(row)):
      item = row[i]
      temp_list.append(item)
      if keywords and item != 0:
        temp_words.append(colnames[i])
    keynums_dict[namelist[j]] = temp_list
    keywords_dict[namelist[j]] = temp_words
    j += 1

  # print dictionary of courses and their keywords in a pretty way
  if keywords:
    print("")
    for course in keywords_dict:
      j = 0
      print(course + ":")
      for word in keywords_dict[course]:
          print(word, end = " ")
          j += 1
          if j > 20:
            print("")
            j = 0
      print("\n")

    # print list of keywords in a pretty way
    print("")
    j = 0
    for word in matrix_names:
      print(word, end=" ")
      j += 1
      if j > 20:
        print("")
        j = 0

  return keynums_dict, colnames

    
    
    


In [14]:
# Called from within search(). Creates sparce matrix of user's search.
# If user's search is in the list of keywords, add that keyword to the sparce matrix.
# Also adds the WordNet lexical similarity score for very similar keywords to the 
# user's search to return more relevant results.
#
# inputs:
#    user_search: string containing the lowercased user's search
#    keywords_list: list of the keywords correspoding to each number in the sparce vector
#         generated by create_tfdif()
#    basic: defaults to False, set to True for a search without WordNet, for
#         testing purposes
# output: returns a sparce matrix of the user's search, matching keywords in the
#         search and including the similarity scores for highly similar keywords

def get_user_matrix(user_search, keywords_list, basic=False):
  new_search = []

  # create sparce matrix of the user's search
  user_matrix = []

  # create sparce matrix of user search based on keywords from course descriptions
  # uses wordnet to also match the user's search to synonyms
  if not basic:
    for word in keywords_list:
      running_score = 0
      if word in user_search.split(" "):
        running_score += 1
      else:
        for term in user_search.split(" "):
          score = None
          if len(wordnet.synsets(term)) > 0:
            wn_term = wordnet.synsets(term)[0]
            if len(wordnet.synsets(word)) > 0:
              wn_word = wordnet.synsets(word)[0]
              score = wn_term.wup_similarity(wn_word)
          if score is None:
            score = 0
          if score > 0.9 and score < 1.0:   # cutoff for word similarity
            running_score += score    
          if score == 1.0:
            running_score += 1.0
      if running_score >= 0.9:
        user_matrix.append(running_score)
      else:
        user_matrix.append(0)
  
  # search without WordNet for testing purposes
  if basic:
    for i in range(len(keywords_list)):
      present = False
      for word in user_search.split(" "):
        if word == keywords_list[i]:
          present = True
      if present:
        user_matrix.append(1)
      else:
        user_matrix.append(0)

  # normalize matrix
  numsum = max(user_matrix)
  if numsum != 0:
    for i in range(len(user_matrix)):
      user_matrix[i] = user_matrix[i]/numsum

  return user_matrix

# test case
# get_user_matrix("cats lion fruit american", ["dogs", "lion", "cat", "strawberry", "sheep", "cats", "russian", "russia", "america"])

In [15]:
# Function that handles the user's search. Prints prompt and creates sparce matrix
# of user search. Compares the user's matrix to every course's sparce matrix using
# cosine similarity. Prints top ten most relevant courses and their course
# descriptions.
#
# inputs:
#    keynums_dict: dictionary of the course names and their sparce vectors
#    matrix_names: list of the keywords correspoding to each number in the sparce vector
#         generated by create_tfdif()
#    basic: defaults to False, set to True for a search without WordNet, for
#         testing purposes
# outputs: none except text of course description results printed to console

def search(keynums_dict, matrix_names, classdesc_dict, basic=False):
  user_search = input("Please enter your search: ").lower()

  if basic:
    print("Using basic search function.")
  user_matrix = get_user_matrix(user_search, matrix_names, basic)

  # get cosine similarity of user search matrix to all matricies of course descriptions
  class_list = []
  matrix_list = []
  for key in keynums_dict: # create matching list of classes and sparce matricies
    class_list.append(key)
    matrix_list.append(keynums_dict[key])

  # get matrices with highest cosine similarity to search matrix
  matrix_list.insert(0, user_matrix)
  matrix_matrix = np.array(matrix_list)  # convert to matrix for comparison
  dist_out = 1-pairwise_distances(matrix_matrix, metric="cosine")
  similarity = list(dist_out[0])
  sim_index = [i for i,v in enumerate(similarity) if v > 0]   # index of non-zero values
  search_list = []
  for i in range(len(similarity)):  # create list of non-zero values and the courses they belong to
    if i in sim_index:
      if i == 0:  # matching own matrix
        continue
      search_list.append((class_list[i-1], similarity[i]))
  
  # return course names and descriptions from search
  search_list.sort(key=lambda a:a[1], reverse=True)
  i = 0
  print("Search results:")
  for course in search_list:
    if i > 9: # if there are more than 10 non-zero courses, return 10 with highest similarity
      break
    print(str(i+1) + ". " + course[0])
    j = 0
    for word in classdesc_dict[course[0]].split(" "):
      print(word, end = " ")
      if j > 10:
        print("")
        j = 0
      j += 1
    print("")
    i += 1

  if not search_list:
    print("Sorry, no results found.")


  

In [24]:
# Full web-scraper
# 
# Scrapes courses and course descriptions from Timetable website, generates
# dictionary and "course_desc.txt" file necessary for TF-IDF and search.
# Note: this will take about 17 minutes because of the structure of the Timetable
#
# DO NOT RUN BOTH THIS CELL AND THE CELL DIRECTLY BELOW IT - they have duplicate
# functionality

datatable = get_data()
classdesc_dict = make_list_file(datatable)
keynums_dict, matrix_names = create_tfidf(classdesc_dict)

cell_executed = True  # prevents cell below from being run if user hits "run all"



In [18]:
# Time-saver
# 
# Converts "course_desc.txt" file to dictionary necessary for TF-IDF and search.
# Saves time because the Timetable website does not have to be scraped.
# Note: must upload "course_dec.txt" to Files tab on left in Colab
#
# DO NOT RUN BOTH THIS CELL AND THE CELL DIRECTLY ABOVE IT - they have duplicate
# functionality

if not cell_executed:
  classdesc_dict = get_dict_from_file("course_desc.txt")
  keynums_dict, matrix_names = create_tfidf(classdesc_dict)



In [23]:
# User interface search function. Returns courses from the Dartmouth Timetable
# relevant to the user's search.
# Note: the first time you run the search, it can take up to a minute. After that, 
#       the search is fairly quick
#
# Instructions: Run cell by pressing start button on the left. Type search terms 
# into the box that appears. If there are relevant courses, they will appear with
# the most relevant courses at the top. Run the cell again to make another search.

search(keynums_dict, matrix_names, classdesc_dict)

Please enter your search: calculus
Search results:
1. Calculus
This course is an introduction to single variable calculus aimed at students 
who have seen some calculus before, either before matriculation or in 
MATH 1.  MATH 3 begins by revisiting the core topics in 
MATH 1 - convergence, limits, and derivatives - in greater depth 
before moving to applications of differentiation such as related rates, finding 
extreme values, and optimization.  The course then turns to integration theory, 
introducing the integral via Riemann sums, the fundamental theorem of calculus, 
and basic techniques of integration.  
 
2. Accelerated Multivar Calc
This briskly paced course can be viewed as equivalent to MATH 13 
in terms of prerequisites, but is designed especially for first-year students 
who have successfully completed a BC calculus curriculum in secondary school. 
In particular, as part of its syllabus it includes most of 
the multivariable calculus material present in MATH 8 together with 