## Rice University

This script serves as a basic tutorial for extracting courses of interest from a university. This is by no means the only (or even best way) to go about this process—so if you come up with a process that works better, feel free to implement! If you're unfamiliar with any of the libraries, the comments below annotate reasoning behind each.

In [122]:
import sys
import pandas as pd
import numpy as np
import time
import re
import urllib.request #handles urls
from urllib.request import urlopen
import urllib.parse 
import linkGrabber #extracts urls
import json #encodes/decodes json 
import csv 
import requests #downloads a webpage to scrape
from bs4 import BeautifulSoup, NavigableString, Tag #beautifulsoup pulls data from HTML
import nltk #NLP tasks
from nltk import word_tokenize
from nltk.stem import PorterStemmer #removes word endings
stemmer = PorterStemmer()

The first thing we want to do is set up a function for standard preprocessing. It's also useful to list all of the URLs we'll need to send requests to before scraping. We want all courses within a 2 year *academic* calendar (as opposed to an annual calendar). 

In [123]:
#keyword preprocessing
def preprocess(keyword):
    keyword = keyword.lower() #lowercase
    keyword = word_tokenize(keyword) #tokenize
    for word in keyword:
        keyword = stemmer.stem(word) #stem 
    return (keyword)

Next, we'll want to import our keyword csv, split our keyword lists, and preprocess them. The way the csv is set up, we'll want to split the words that are indicated as technical (`T`) or normative (`N`) and that we've chosen to include (`Y`). You'll notice that preprocessing is useful for some of our words but not for others. Here, we've chosen to manually alter words that are not usefully preprocessed. In this case, it means replacing instances of words that are stemmed to end in i.

[regex is a bitch here]

In [124]:
#import keywords
keywords = pd.read_csv("../keywords.csv")
technical = keywords[(keywords['Technical/Normative']=='T') & (keywords['Include']=='Y')].Keyword
normative = keywords[(keywords['Technical/Normative']=='N') & (keywords['Include']=='Y')].Keyword
normative = [preprocess(i) for i in normative]
technical = [preprocess(i) for i in technical] 

#replace keywords of interest
normative = [w.replace('privaci', 'privac') for w in normative]
normative = [w.replace('democraci', 'democra') for w in normative]
normative = [w.replace('equiti', 'equit') for w in normative]
normative = [w.replace('histori', 'histor') for w in normative]
normative = [w.replace('justice', 'justic') for w in normative]
normative = [w.replace('liberti', 'libert') for w in normative]
normative = [w.replace('philosophi', 'philosoph') for w in normative]
normative = [w.replace('societi', 'societ') for w in normative]
normative = [w.replace('polici', 'polic') for w in normative]

technical = [w.replace('ai', '^ai') for w in technical]
technical = [w.replace('cs', '^cs') for w in technical]
technical = [w.replace('ict', '^ict') for w in technical]
technical = [w.replace('ml', '^ml') for w in technical]
technical = [w.replace('nlp', '^nlp') for w in technical]

print(normative)
print(technical)

['account', 'critic', 'democra', 'discrimin', 'equal', 'equit', 'ethic', 'fair', 'femin', 'gender', 'govern', 'histor', 'inequ', 'justic', 'law', 'legal', 'libert', 'moral', 'norm', 'philosoph', 'polit', 'power', 'privac', 'race', 'religi', 'respons', 'right', 'secur', 'social', 'societ', 'surveil', 'transpar', 'valu', 'polic']
['^ai', 'algorithm', 'analyt', 'intellig', 'automat', 'code', 'comput', '^cs', 'cyber', 'data', 'digit', '^ict', 'inform', 'intelligen', 'internet', 'machin', '^ml', 'process', '^nlp', 'platform', 'program', 'robot', 'softwar', 'system', 'technolog']


The process behind extracting relevant courses works in two steps:
1. First, we want to find and extract all courses that contain any instance of a normative keyword.
2. Then, we want search within these courses to see if it also contains a technical keyword.

We initialize a data frame with columns for all of the course items we want to extract. It probably makes the most sense to standardize these feature names across all university scripts so that they're easier to merge in the final compiled dataset for all universities. Our items of interest are:
* The course title: `title`
* The department and course number: `dept_num`
* The course description: `description`
* The number of credits for the course: `credits`
* The course instructor: `instructor`
* The link to the course syllabus (if applicable): `syllabus`
* The university the course is extracted from: `university`
* The term that the course is offered during (fall, spring, summer / year): `term`
* The keyword that triggered the extraction (this is for auditing purposes): `keyword`

In [125]:
rice_list = []

from selenium import webdriver
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome()

#course catalog URLs - 2 academic years 
terms = ['Fall Semester 2017', 
         'Spring Semester 2018', 
         'Summer Semester 2018', 
         'Fall Semester 2018', 
         'Spring Semester 2019', 
         'Summer Semester 2019',
         'Summer Quadmester 2018',
         'Fall Quadmester 2018',
         'Winter Quadmester 2019',
         'Spring Quadmester 2019']

for term in terms:
    print('--------------------')
    print('--------------------')
    print(term)
    
    for word in normative:
        driver.get("https://courses.rice.edu/courses/swkscat.main")
        time.sleep(2)
        
        select = Select(driver.find_element_by_xpath('//*[@id="p_term"]'))
        select.select_by_visible_text(term)

#         # for word in normative:
#         subject_field = driver.find_element_by_xpath('//*[@id="subj_id"]') 
#         subject_select = Select(subject_field)

#         #select all subjects
#         for subject in subject_field.find_elements_by_tag_name('option'):
#             subject_select.select_by_visible_text(subject.text)

        text_input = driver.find_element_by_xpath('//*[@id="p_onebar"]')
        
        print('--------------------')
        print(word)
        
        text_input.send_keys(word)
        get_course = driver.find_element_by_xpath('//*[@id="p_submit"]')
        get_course.click()
        time.sleep(2)
        
        all_courses = driver.find_elements_by_xpath('//*[@id="searchPage"]/div/div[5]/div/table/tbody')

        dept_nums = driver.find_elements_by_class_name('cls-crs')[1:]
        titles = driver.find_elements_by_class_name('cls-ttl')[1:]
        instructors = driver.find_elements_by_class_name('cls-ins')[1:]
        all_credits = driver.find_elements_by_class_name('cls-crd')[1:]
        sessions = driver.find_elements_by_class_name('cls-ses')[1:]

        for dept_num, title, instructor, credits, session in zip(dept_nums, titles, instructors, all_credits, sessions):
            rice_dict = {}
            rice_dict['dept_num'] = dept_num.text
            rice_dict['credits'] = credits.text
            rice_dict['instructor'] = instructor.text
            rice_dict['title'] = title.text
            rice_dict['session'] = session.text
#             rice_dict['description'] = description.text
            rice_dict['term'] = term
            rice_dict['keyword'] = word
            rice_dict['university'] = 'rice university'
            
            #keyword search, so make sure words are also present in titles
            if word.upper() in title.text:
                rice_list.append(rice_dict)
            
driver.close()

--------------------
--------------------
Fall Semester 2017
--------------------
account
--------------------
critic
--------------------
democra
--------------------
discrimin
--------------------
equal
--------------------
equit
--------------------
ethic
--------------------
fair
--------------------
femin
--------------------
gender
--------------------
govern
--------------------
histor
--------------------
inequ
--------------------
justic
--------------------
law
--------------------
legal
--------------------
libert
--------------------
moral
--------------------
norm
--------------------
philosoph
--------------------
polit
--------------------
power
--------------------
privac
--------------------
race
--------------------
religi
--------------------
respons
--------------------
right
--------------------
secur
--------------------
social
--------------------
societ
--------------------
surveil
--------------------
transpar
--------------------
valu
--------------------
poli

--------------------
equal
--------------------
equit
--------------------
ethic
--------------------
fair
--------------------
femin
--------------------
gender
--------------------
govern
--------------------
histor
--------------------
inequ
--------------------
justic
--------------------
law
--------------------
legal
--------------------
libert
--------------------
moral
--------------------
norm
--------------------
philosoph
--------------------
polit
--------------------
power
--------------------
privac
--------------------
race
--------------------
religi
--------------------
respons
--------------------
right
--------------------
secur
--------------------
social
--------------------
societ
--------------------
surveil
--------------------
transpar
--------------------
valu
--------------------
polic
--------------------
--------------------
Spring Quadmester 2019
--------------------
account
--------------------
critic
--------------------
democra
--------------------
disc

In [126]:
rice = pd.DataFrame(rice_list)
rice

Unnamed: 0,credits,dept_num,instructor,keyword,session,term,title,university
0,3,BUSI 305 003,"Naranjo Olivares, Patricia L.",account,Full Term,Fall Semester 2017,FINANCIAL ACCOUNTING,rice university
1,3,BUSI 305 004,"Naranjo Olivares, Patricia L.",account,Full Term,Fall Semester 2017,FINANCIAL ACCOUNTING,rice university
2,3,MACC 501 001,"Windsor, Duane\nButler, Lee Ann E.",account,MBA Full Term Fall,Fall Semester 2017,ETHICS IN ACCOUNTING,rice university
3,1.5,MACC 514 001,"Ramesh, Krishnamoorthy",account,MBA ILE 1,Fall Semester 2017,FAIR VALUE ACCOUNTING,rice university
4,1.5,MACC 581 001,"Fralic, Bradley W.",account,MBA Term II,Fall Semester 2017,GOVT AND NFP ACCOUNTING,rice university
5,3,MGMP 501 001,"Lansford, Benjamin N.",account,PMBA First Year 2,Fall Semester 2017,FINANCIAL ACCOUNTING,rice university
6,3,MGMP 501 002,"Lansford, Benjamin N.",account,PMBA First Year 2,Fall Semester 2017,FINANCIAL ACCOUNTING,rice university
7,1.5,MGMP 602 001,"Wang, Sol S.",account,MBA Term II,Fall Semester 2017,ACCOUNTING-BASED VALUATION,rice university
8,1.5,MGMP 602 002,"Wang, Sol S.",account,EMBA Term IV,Fall Semester 2017,ACCOUNTING-BASED VALUATION,rice university
9,3,MGMT 501 001,"Akins, Brian K.",account,MBA Full Term Fall,Fall Semester 2017,FINANCIAL ACCOUNTING,rice university


The loop below executes part 1 of our extraction. It's long and kind of messy (sorry), so feel free to play around with the structure if you'd like. The key tasks here are to extract our items of interest based on our search queries and append them to our data frame.

Now that we've extracted all courses containing a normative keyword of interest, we need to filter our courses to only return titles that contain a normative AND a technical keyword. This is the case for all words except instances of our preprocessed `privac` and `secur`, for which we want to return all courses, even if they don't contain two keywords. To do this, we'll split the courses into two data frames, apply our respective conditions, and then merge them back together. 

In [127]:
exceptions = rice.loc[(rice['keyword']=='privac') | (rice['keyword'] =='secur')]
exceptions

Unnamed: 0,credits,dept_num,instructor,keyword,session,term,title,university
181,3.0,AFSC 401 001,,secur,Full Term,Fall Semester 2017,NATIONAL SECURITY AFFAIRS I,rice university
182,3.0,GLBL 552 001,"Birenbaum, Cory S.",secur,Full Term,Fall Semester 2017,INTERNATIONAL SECURITY,rice university
183,1.5,MGMT 674 001,"Duarte, Jefferson",secur,MBA Term I,Fall Semester 2017,REAL ESTATE FINANCE:SECURITIES,rice university
386,3.0,AFSC 402 001,,secur,Full Term,Spring Semester 2018,NATIONAL SECURITY AFFAIRS II,rice university
387,3.0,COMP 427 001,"Wallach, Dan S.",secur,Full Term,Spring Semester 2018,INTRO TO COMPUTER SECURITY,rice university
388,3.0,COMP 541 001,"Wallach, Dan S.",secur,Full Term,Spring Semester 2018,INTRO TO COMPUTER SECURITY,rice university
389,3.0,GLBL 525 001,"Ard, Michael J.",secur,Full Term,Spring Semester 2018,INTERNATIONAL SECURITY,rice university
620,3.0,AFSC 401 001,,secur,Full Term,Fall Semester 2018,NATIONAL SECURITY AFFAIRS I,rice university
621,3.0,COMP 436 001,"Chen, Ang",secur,Full Term,Fall Semester 2018,SECURE & CLOUD COMPUTING,rice university
622,3.0,COMP 536 001,"Chen, Ang",secur,Full Term,Fall Semester 2018,SECURE & CLOUD COMPUTING,rice university


In [128]:
#loop through technical keyword list, extract relevant titles
for word in technical:
    df = rice[rice['title'].str.contains(word, flags = re.IGNORECASE)]
    df['keyword2'] = word
    
#join keyword cols
df["keyword"] = df["keyword"].map(str) + "," + df["keyword2"]
df = df.drop(columns="keyword2")

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,credits,dept_num,instructor,keyword,session,term,title,university


NOTE: the above cell is likely not the best nor most simple way to execute this step! Feel free to take special liberties here. It's probably wise to pick out a few titles that you know should be returned manually, then check to see if the script is working as desired. 

In [129]:
#combine dfs 
rice = pd.concat([df, exceptions])
rice

Unnamed: 0,credits,dept_num,instructor,keyword,session,term,title,university
181,3.0,AFSC 401 001,,secur,Full Term,Fall Semester 2017,NATIONAL SECURITY AFFAIRS I,rice university
182,3.0,GLBL 552 001,"Birenbaum, Cory S.",secur,Full Term,Fall Semester 2017,INTERNATIONAL SECURITY,rice university
183,1.5,MGMT 674 001,"Duarte, Jefferson",secur,MBA Term I,Fall Semester 2017,REAL ESTATE FINANCE:SECURITIES,rice university
386,3.0,AFSC 402 001,,secur,Full Term,Spring Semester 2018,NATIONAL SECURITY AFFAIRS II,rice university
387,3.0,COMP 427 001,"Wallach, Dan S.",secur,Full Term,Spring Semester 2018,INTRO TO COMPUTER SECURITY,rice university
388,3.0,COMP 541 001,"Wallach, Dan S.",secur,Full Term,Spring Semester 2018,INTRO TO COMPUTER SECURITY,rice university
389,3.0,GLBL 525 001,"Ard, Michael J.",secur,Full Term,Spring Semester 2018,INTERNATIONAL SECURITY,rice university
620,3.0,AFSC 401 001,,secur,Full Term,Fall Semester 2018,NATIONAL SECURITY AFFAIRS I,rice university
621,3.0,COMP 436 001,"Chen, Ang",secur,Full Term,Fall Semester 2018,SECURE & CLOUD COMPUTING,rice university
622,3.0,COMP 536 001,"Chen, Ang",secur,Full Term,Fall Semester 2018,SECURE & CLOUD COMPUTING,rice university


In [130]:
rice = rice[['title', 'dept_num', 'credits', 'instructor', 'university', 'term', 'keyword', 'session']]
rice

Unnamed: 0,title,dept_num,credits,instructor,university,term,keyword,session
181,NATIONAL SECURITY AFFAIRS I,AFSC 401 001,3.0,,rice university,Fall Semester 2017,secur,Full Term
182,INTERNATIONAL SECURITY,GLBL 552 001,3.0,"Birenbaum, Cory S.",rice university,Fall Semester 2017,secur,Full Term
183,REAL ESTATE FINANCE:SECURITIES,MGMT 674 001,1.5,"Duarte, Jefferson",rice university,Fall Semester 2017,secur,MBA Term I
386,NATIONAL SECURITY AFFAIRS II,AFSC 402 001,3.0,,rice university,Spring Semester 2018,secur,Full Term
387,INTRO TO COMPUTER SECURITY,COMP 427 001,3.0,"Wallach, Dan S.",rice university,Spring Semester 2018,secur,Full Term
388,INTRO TO COMPUTER SECURITY,COMP 541 001,3.0,"Wallach, Dan S.",rice university,Spring Semester 2018,secur,Full Term
389,INTERNATIONAL SECURITY,GLBL 525 001,3.0,"Ard, Michael J.",rice university,Spring Semester 2018,secur,Full Term
620,NATIONAL SECURITY AFFAIRS I,AFSC 401 001,3.0,,rice university,Fall Semester 2018,secur,Full Term
621,SECURE & CLOUD COMPUTING,COMP 436 001,3.0,"Chen, Ang",rice university,Fall Semester 2018,secur,Full Term
622,SECURE & CLOUD COMPUTING,COMP 536 001,3.0,"Chen, Ang",rice university,Fall Semester 2018,secur,Full Term


Lastly, we want to export our csv. Ideally, all csv files should be written to the courses directory in our repository. 

In [131]:
#export as csv
rice.to_csv('../courses/22-Rice-University.csv', index=False)