## Columbia University

This script serves as a basic tutorial for extracting courses of interest from a university. This is by no means the only (or even best way) to go about this process—so if you come up with a process that works better, feel free to implement! If you're unfamiliar with any of the libraries, the comments below annotate reasoning behind each.

In [36]:
import pandas as pd
import numpy as np
import time
import re
import urllib.request #handles urls
from urllib.request import urlopen
import urllib.parse 
import linkGrabber #extracts urls
import json #encodes/decodes json 
import csv 
import requests #downloads a webpage to scrape
from bs4 import BeautifulSoup, NavigableString, Tag #beautifulsoup pulls data from HTML
import nltk #NLP tasks
from nltk import word_tokenize
from nltk.stem import PorterStemmer #removes word endings
stemmer = PorterStemmer()

The first thing we want to do is set up a function for standard preprocessing. It's also useful to list all of the URLs we'll need to send requests to before scraping. We want all courses within a 2 year *academic* calendar (as opposed to an annual calendar). 

In [37]:
#keyword preprocessing
def preprocess(keyword):
    keyword = keyword.lower() #lowercase
    keyword = word_tokenize(keyword) #tokenize
    for word in keyword:
        keyword = stemmer.stem(word) #stem 
    return (keyword)

#course catalog URLs - 2 academic years 
#only 2019 is available, Fall(3), Summer(2), Spring(1)
# urls array

urls = [{"term": 'Fall 2019', "url":"?site=Directory_of_Classes&instr=&days=&semes=20191&hour="},
        {"term": 'Summer 2019', "url":'?si?site=Directory_of_Classes&instr=&days=&semes=20192&hour='},
        {"term": 'Spring 2019', "url":'?site=Directory_of_Classes&instr=&days=&semes=20193&hour='}]

link = 'https://doc.search.columbia.edu/classes/'

Next, we'll want to import our keyword csv, split our keyword lists, and preprocess them. The way the csv is set up, we'll want to split the words that are indicated as technical (`T`) or normative (`N`) and that we've chosen to include (`Y`). You'll notice that preprocessing is useful for some of our words but not for others. Here, we've chosen to manually alter words that are not usefully preprocessed. In this case, it means replacing instances of words that are stemmed to end in i.

[regex is a bitch here]

In [38]:
#import keywords
keywords = pd.read_csv("../keywords.csv")
technical = keywords[(keywords['Technical/Normative']=='T') & (keywords['Include']=='Y')].Keyword
normative = keywords[(keywords['Technical/Normative']=='N') & (keywords['Include']=='Y')].Keyword
normative = [preprocess(i) for i in normative]
technical = [preprocess(i) for i in technical] 

#replace keywords of interest
normative = [w.replace('privaci', 'privac') for w in normative]
normative = [w.replace('democraci', 'democra') for w in normative]
normative = [w.replace('equiti', 'equit') for w in normative]
normative = [w.replace('histori', 'histor') for w in normative]
normative = [w.replace('justice', 'justic') for w in normative]
normative = [w.replace('liberti', 'libert') for w in normative]
normative = [w.replace('philosophi', 'philosoph') for w in normative]
normative = [w.replace('societi', 'societ') for w in normative]
normative = [w.replace('polici', 'polic') for w in normative]

technical = [w.replace('ai', '^ai') for w in technical]
technical = [w.replace('cs', '^cs') for w in technical]
technical = [w.replace('ict', '^ict') for w in technical]
technical = [w.replace('ml', '^ml') for w in technical]
technical = [w.replace('nlp', '^nlp') for w in technical]

print(normative)
print(technical)

['account', 'critic', 'democra', 'discrimin', 'equal', 'equit', 'ethic', 'fair', 'femin', 'gender', 'govern', 'histor', 'inequ', 'justic', 'law', 'legal', 'libert', 'moral', 'norm', 'philosoph', 'polit', 'power', 'privac', 'race', 'religi', 'respons', 'right', 'secur', 'social', 'societ', 'surveil', 'transpar', 'valu', 'polic']
['^ai', 'algorithm', 'analyt', 'intellig', 'automat', 'code', 'comput', '^cs', 'cyber', 'data', 'digit', '^ict', 'inform', 'intelligen', 'internet', 'machin', '^ml', 'process', '^nlp', 'platform', 'program', 'robot', 'softwar', 'system', 'technolog']


The process behind extracting relevant courses works in two steps:
1. First, we want to find and extract all courses that contain any instance of a normative keyword.
2. Then, we want search within these courses to see if it also contains a technical keyword.

We initialize a data frame with columns for all of the course items we want to extract. It probably makes the most sense to standardize these feature names across all university scripts so that they're easier to merge in the final compiled dataset for all universities. Our items of interest are:
* The course title: `title`
* The department and course number: `dept_num`
* The course description: `description`
* The number of credits for the course: `credits`
* The course instructor: `instructor`
* The link to the course syllabus (if applicable): `syllabus`
* The university the course is extracted from: `university`
* The term that the course is offered during (fall, spring, summer / year): `term`
* The keyword that triggered the extraction (this is for auditing purposes): `keyword`

In [39]:
#init dfs
columbia = pd.DataFrame(columns=['title','dept_num','description','credits','instructor',
                                'syllabus','university','term','keyword'])

The loop below executes part 1 of our extraction. It's long and kind of messy (sorry), so feel free to play around with the structure if you'd like. The key tasks here are to extract our items of interest based on our search queries and append them to our data frame.

In [83]:
#roster search for all urls
from selenium import webdriver

for url in urls:
#     print("url", url["term"])
    #loop through all normative words and extract relevant elements 
    for word in normative: 
        url_keyword = link + word + url['url'] #NOTE:this structure will likely be different between rosters!
        driver = webdriver.Chrome()
        driver.get(url_keyword)
        time.sleep(3)

        #the number of reponses
        elements = driver.find_elements_by_xpath('//*[@id="gsa-search-results"]/li')
        results = len(elements)
        print("elements", len(elements))
        
        #scraping each results
        for x in range(0, results):
            print('-------------------')
            print('x', x)
            driver.get(url_keyword)
            time.sleep(4)
            
            title = driver.find_element_by_xpath('//*[@id="gsa-search-results"]/li[' + str(x+1) +']/div/h3/a').text
            section = driver.find_element_by_xpath('//*[@id="gsa-search-results"]/li[' + str(x+1) +']/div/h3/a/span').text
            title.replace(section, '')
            
            course_link = driver.find_element_by_xpath('//*[@id="gsa-search-results"]/li[' + str(x+1) +']/div/div[2]')
            print('course_link', course_link.text)
            course_link.click()
            time.sleep(3)

            dept_nums = driver.find_element_by_xpath('//*[@id="col-right"]/table/tbody/tr[10]/td[2]').text
            print('dept_nums', dept_nums)
            
            descs = ''
            credit = driver.find_element_by_xpath('//*[@id="col-right"]/table/tbody/tr[4]/td[2]').text
            profs = driver.find_element_by_xpath('//*[@id="col-right"]/table/tbody/tr[7]/td[2]').text
            
            syllabi = ''  
            uni = 'columbia university'
            term = url["term"]  
            keyword = word
            
            columbia.loc[x] = [title, dept_nums, descs, credit, profs, syllabi, uni, term, keyword]
            print('url_keyword', url_keyword)
            print('columbia', columbia)
        
        driver.close()


elements 54
-------------------
x 0
course_link http://www.columbia.edu/cu/bulletin/uwb/#/cu/bulletin/uwb/subj/ACCT/B5001-20191-200
dept_nums Accounting (ACCT)
url_keyword https://doc.search.columbia.edu/classes/account?site=Directory_of_Classes&instr=&days=&semes=20191&hour=
columbia                                                title           dept_num  \
0  Accounting I: Financial Accoun\nSpring 2019 Ac...  Accounting (ACCT)   
1  Accounting I: Financial Accoun\nSpring 2019 Ac...  Accounting (ACCT)   

  description credits                                        instructor  \
0                   3  Amir Ziv\nFelicia C Goodman\nJessica Soursourian   
1                   0                             Faculty\nPhil Mendoza   

  syllabus             university       term    keyword  
0             columbia university  Fall 2019    account  
1           [columbia university]  Fall 2019  [account]  
-------------------
x 1
course_link http://www.columbia.edu/cu/bulletin/uwb/#/cu/bulleti

WebDriverException: Message: unknown error: Element is not clickable at point (425, 1017)
  (Session info: chrome=75.0.3770.100)
  (Driver info: chromedriver=2.30.477690 (c53f4ad87510ee97b5c3425a14c0e79780cdf262),platform=Mac OS X 10.14.5 x86_64)


In [84]:
columbia

Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword
0,Accounting I: Financial Accoun\nSpring 2019 Ac...,Accounting (ACCT),,3,Amir Ziv\nFelicia C Goodman\nJessica Soursourian,,columbia university,Fall 2019,account
1,Accounting I: Financial Accoun\nSpring 2019 Ac...,Accounting (ACCT),,0,Faculty\nPhil Mendoza,,[columbia university],Fall 2019,[account]


Now that we've extracted all courses containing a normative keyword of interest, we need to filter our courses to only return titles that contain a normative AND a technical keyword. This is the case for all words except instances of our preprocessed `privac` and `secur`, for which we want to return all courses, even if they don't contain two keywords. To do this, we'll split the courses into two data frames, apply our respective conditions, and then merge them back together. 

In [15]:
exceptions = columbia.loc[columbia['keyword']=='privac') | (columbia['keyword'] =='secur')]
exceptions

NameError: name 'cornell' is not defined

In [None]:
#loop through technical keyword list, extract relevant titles
for word in technical:
    df = cornell[cornell['title'].str.contains(word, flags = re.IGNORECASE)]
    df['keyword2'] = word
    
#join keyword cols
df["keyword"] = df["keyword"].map(str) + "," + df["keyword2"]
df = df.drop(columns="keyword2")

df

NOTE: the above cell is likely not the best nor most simple way to execute this step! Feel free to take special liberties here. It's probably wise to pick out a few titles that you know should be returned manually, then check to see if the script is working as desired. 

In [None]:
#combine dfs 
columbia = pd.concat([df, exceptions])
columbia

Lastly, we want to export our csv. Ideally, all csv files should be written to the courses directory in our repository. 

In [None]:
#export as csv
columbia.to_csv('../courses/columbia.csv')