## Georgetown University

This script serves as a basic tutorial for extracting courses of interest from a university. This is by no means the only (or even best way) to go about this process—so if you come up with a process that works better, feel free to implement! If you're unfamiliar with any of the libraries, the comments below annotate reasoning behind each.

In [36]:
import sys
import pandas as pd
import numpy as np
import time
import re
import urllib.request #handles urls
from urllib.request import urlopen
import urllib.parse 
import linkGrabber #extracts urls
import json #encodes/decodes json 
import csv 
import requests #downloads a webpage to scrape
from bs4 import BeautifulSoup, NavigableString, Tag #beautifulsoup pulls data from HTML
import nltk #NLP tasks
from nltk import word_tokenize
from nltk.stem import PorterStemmer #removes word endings
stemmer = PorterStemmer()

The first thing we want to do is set up a function for standard preprocessing. It's also useful to list all of the URLs we'll need to send requests to before scraping. We want all courses within a 2 year *academic* calendar (as opposed to an annual calendar). 

In [37]:
#keyword preprocessing
def preprocess(keyword):
    keyword = keyword.lower() #lowercase
    keyword = word_tokenize(keyword) #tokenize
    for word in keyword:
        keyword = stemmer.stem(word) #stem 
    return (keyword)

Next, we'll want to import our keyword csv, split our keyword lists, and preprocess them. The way the csv is set up, we'll want to split the words that are indicated as technical (`T`) or normative (`N`) and that we've chosen to include (`Y`). You'll notice that preprocessing is useful for some of our words but not for others. Here, we've chosen to manually alter words that are not usefully preprocessed. In this case, it means replacing instances of words that are stemmed to end in i.

[regex is a bitch here]

In [38]:
#import keywords
keywords = pd.read_csv("../keywords.csv")
technical = keywords[(keywords['Technical/Normative']=='T') & (keywords['Include']=='Y')].Keyword
normative = keywords[(keywords['Technical/Normative']=='N') & (keywords['Include']=='Y')].Keyword
normative = [preprocess(i) for i in normative]
technical = [preprocess(i) for i in technical] 

#replace keywords of interest
normative = [w.replace('privaci', 'privac') for w in normative]
normative = [w.replace('democraci', 'democra') for w in normative]
normative = [w.replace('equiti', 'equit') for w in normative]
normative = [w.replace('histori', 'histor') for w in normative]
normative = [w.replace('justice', 'justic') for w in normative]
normative = [w.replace('liberti', 'libert') for w in normative]
normative = [w.replace('philosophi', 'philosoph') for w in normative]
normative = [w.replace('societi', 'societ') for w in normative]
normative = [w.replace('polici', 'polic') for w in normative]

technical = [w.replace('ai', '^ai') for w in technical]
technical = [w.replace('cs', '^cs') for w in technical]
technical = [w.replace('ict', '^ict') for w in technical]
technical = [w.replace('ml', '^ml') for w in technical]
technical = [w.replace('nlp', '^nlp') for w in technical]

print(normative)
print(technical)

['account', 'critic', 'democra', 'discrimin', 'equal', 'equit', 'ethic', 'fair', 'femin', 'gender', 'govern', 'histor', 'inequ', 'justic', 'law', 'legal', 'libert', 'moral', 'norm', 'philosoph', 'polit', 'power', 'privac', 'race', 'religi', 'respons', 'right', 'secur', 'social', 'societ', 'surveil', 'transpar', 'valu', 'polic']
['^ai', 'algorithm', 'analyt', 'intellig', 'automat', 'code', 'comput', '^cs', 'cyber', 'data', 'digit', '^ict', 'inform', 'intelligen', 'internet', 'machin', '^ml', 'process', '^nlp', 'platform', 'program', 'robot', 'softwar', 'system', 'technolog']


The process behind extracting relevant courses works in two steps:
1. First, we want to find and extract all courses that contain any instance of a normative keyword.
2. Then, we want search within these courses to see if it also contains a technical keyword.

We initialize a data frame with columns for all of the course items we want to extract. It probably makes the most sense to standardize these feature names across all university scripts so that they're easier to merge in the final compiled dataset for all universities. Our items of interest are:
* The course title: `title`
* The department and course number: `dept_num`
* The course description: `description`
* The number of credits for the course: `credits`
* The course instructor: `instructor`
* The link to the course syllabus (if applicable): `syllabus`
* The university the course is extracted from: `university`
* The term that the course is offered during (fall, spring, summer / year): `term`
* The keyword that triggered the extraction (this is for auditing purposes): `keyword`

In [39]:
#init dfs
georgetown_list = []

In [44]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome()

georgetown_list = []

#course catalog URLs - 2 academic years 
terms = ['Fall 2017', 'Spring 2018', 'Summer 2018', 'Fall 2018', 'Spring 2019', 'Summer 2019']


for term in terms:
    print('--------------------')
    print('--------------------')
    print(term)
    term = term + ' (View only)'
    
    for word in normative:
        driver.get("https://myaccess.georgetown.edu/pls/bninbp/bwckschd.p_disp_dyn_sched#_ga=")
        time.sleep(2)
        
        select = Select(driver.find_element_by_xpath('//*[@id="contentHolder"]/div[2]/form/table/tbody/tr/td/select'))
        submit = driver.find_element_by_xpath('//*[@id="id____UID0"]')
        select.select_by_visible_text(term)
        submit.click()
        time.sleep(2)

        # for word in normative:
        subject_field = driver.find_element_by_xpath('//*[@id="subj_id"]') 
        subject_select = Select(subject_field)

        #select all subjects
        for subject in subject_field.find_elements_by_tag_name('option'):
            subject_select.select_by_visible_text(subject.text)

        text_input = driver.find_element_by_xpath('//*[@id="title_id"]')
        
        print('--------------------')
        print(word)
        
        text_input.send_keys(word)
        get_course = driver.find_element_by_xpath('//*[@id="id____UID0"]')
        get_course.click()
        time.sleep(2)

        all_courses = driver.find_element_by_xpath('//*[@id="contentHolder"]/div[2]/table[1]/tbody')
        courses = all_courses.find_elements_by_tag_name('tr')

        full_titles = driver.find_elements_by_class_name('ddtitle')
        descriptions = driver.find_elements_by_class_name('dddefault')
        

        counter = 0

        for full_title, description in zip(full_titles, descriptions):
            print(counter)
            counter += 1

#             print("full_title", full_title.text)
            print("description", description.text)


            georgetown_dict = {}
            title_split = full_title.text.split('-')

#             for title_el in title_split:
#                 print(title_el)
                    
            dept_num = title_split[0]
            georgetown_dict['dept_num'] = title_split[2]
            georgetown_dict['title'] = title_split[0]
                
            credit_regex = r'[0-9]\.[0-9]{3} Credits'
            credits = re.findall(credit_regex, description.text)

            georgetown_dict['credits'] = credits[0].replace('Credits', '')
            georgetown_dict['description'] = description.text
            georgetown_dict['term'] = term.replace(' (View only)', '')
            georgetown_dict['keyword'] = word
            georgetown_list.append(georgetown_dict)

driver.close()

--------------------
--------------------
Fall 2017
--------------------
account
0
description See individual course for departmental web site, faculty profiles, and course descriptions.

Associated Term: Fall 2017
Registration Dates: Mar 30, 2017 to Sep 09, 2017
Levels: MN or MC Graduate, Undergraduate

Main Campus  
Lecture Schedule Type
0.000 Credits
View Course Description
View Syllabus
View Textbook
1
description Associated Term: Fall 2017
Registration Dates: Mar 30, 2017 to Sep 09, 2017
Levels: Undergraduate
Attributes: SFS/IECO Finance/Commerce (B), SFS/STIA Growth/Development

Main Campus  
Lecture Schedule Type
3.000 Credits
View Course Description
View Syllabus
View Textbook

Scheduled Meeting Times
Type Time Days Where Date Range Schedule Type Instructors
Lecture 2:00 pm - 3:15 pm MW Healy 103 Aug 30, 2017 - Dec 20, 2017 Lecture Edward Machir
2
description Lecture


IndexError: list index out of range

In [43]:
print(description.text)

Lecture


In [41]:
georgetown = pd.DataFrame(georgetown_list)
georgetown

Unnamed: 0,credits,dept_num,description,keyword,term,title
0,0.0,ACCT 000,See individual course for departmental web sit...,account,Fall 2017,Accounting
1,3.0,ACCT 001,Associated Term: Fall 2017\nRegistration Dates...,account,Fall 2017,Principles of Accounting


In [None]:
georgetown = pd.DataFrame(georgetown_list)
georgetown

The loop below executes part 1 of our extraction. It's long and kind of messy (sorry), so feel free to play around with the structure if you'd like. The key tasks here are to extract our items of interest based on our search queries and append them to our data frame.

Now that we've extracted all courses containing a normative keyword of interest, we need to filter our courses to only return titles that contain a normative AND a technical keyword. This is the case for all words except instances of our preprocessed `privac` and `secur`, for which we want to return all courses, even if they don't contain two keywords. To do this, we'll split the courses into two data frames, apply our respective conditions, and then merge them back together. 

In [None]:
exceptions = georgetown.loc[(georgetown['keyword']=='privac') | (georgetown['keyword'] =='secur')]
exceptions

In [None]:
#loop through technical keyword list, extract relevant titles
for word in technical:
    df = georgetown[georgetown['title'].str.contains(word, flags = re.IGNORECASE)]
    df['keyword2'] = word
    
#join keyword cols
df["keyword"] = df["keyword"].map(str) + "," + df["keyword2"]
df = df.drop(columns="keyword2")

df

NOTE: the above cell is likely not the best nor most simple way to execute this step! Feel free to take special liberties here. It's probably wise to pick out a few titles that you know should be returned manually, then check to see if the script is working as desired. 

In [None]:
#combine dfs 
georgetown = pd.concat([df, exceptions])
georgetown

In [None]:
georgetown = georgetown[['title', 'dept_num', 'description', 'credits', 'term', 'keyword']]
georgetown

Lastly, we want to export our csv. Ideally, all csv files should be written to the courses directory in our repository. 

In [None]:
#export as csv
georgetown.to_csv('../courses/6-Georgetown-University-Schedule.csv', index=False)