## Example course catalog crawler 

In [1]:
import pandas as pd
import numpy as np
import re
import urllib.request #handles urls
import urllib.parse 
import linkGrabber #extracts urls
import json #encodes/decodes json 
import csv 
import requests #downloads a webpage to scrape
from bs4 import BeautifulSoup, NavigableString, Tag #beautifulsoup pulls data from HTML
import nltk #NLP tasks
from nltk import word_tokenize
from nltk.stem import PorterStemmer #removes word endings
stemmer = PorterStemmer()

The first thing we want to do is set up a function for standard preprocessing. It's also useful to list all of the URLs we'll need to send requests to before scraping. We want all courses within a 2 year *academic* calendar (as opposed to an annual calendar). 

In [2]:
#keyword preprocessing
def preprocess(keyword):
    keyword = keyword.lower() #lowercase
    keyword = word_tokenize(keyword) #tokenize
    for word in keyword:
        keyword = stemmer.stem(word) #stem 
    return (keyword)

#course catalog URLs - 2 academic years 
urls = ['https://classes.cornell.edu/search/roster/FA17?q=',
        'https://classes.cornell.edu/search/roster/SP18?q=',
        'https://classes.cornell.edu/search/roster/SU18?q=',
        'https://classes.cornell.edu/search/roster/FA18?q=',
        'https://classes.cornell.edu/search/roster/SP19?q=',
        'https://classes.cornell.edu/search/roster/SU19?q=']

Next, we'll want to import our keyword csv, split our keyword lists, and preprocess them. The way the csv is set up, we'll want to split the words that are indicated as technical (`T`) or normative (`N`) and that we've chosen to include (`Y`). You'll notice that preprocessing is useful for some of our words but not for others. Here, we've chosen to manually alter words that are not usefully preprocessed. In this case, it means replacing instances of words that are stemmed to end in i.

In [3]:
#import keywords
keywords = pd.read_csv("../keywords.csv")
technical = keywords[(keywords['Technical/Normative']=='T') & (keywords['Include']=='Y')].Keyword
normative = keywords[(keywords['Technical/Normative']=='N') & (keywords['Include']=='Y')].Keyword
normative = [preprocess(i) for i in normative]
technical = [preprocess(i) for i in technical] 

#replace keywords of interest
normative = [w.replace('privaci', 'privac') for w in normative]
normative = [w.replace('democraci', 'democra') for w in normative]
normative = [w.replace('equiti', 'equit') for w in normative]
normative = [w.replace('histori', 'histor') for w in normative]
normative = [w.replace('justice', 'justic') for w in normative]
normative = [w.replace('liberti', 'libert') for w in normative]
normative = [w.replace('philosophi', 'philosoph') for w in normative]
normative = [w.replace('societi', 'societ') for w in normative]
normative = [w.replace('polici', 'polic') for w in normative]

technical = [w.replace('ai', '^ai') for w in technical]
technical = [w.replace('cs', '^cs') for w in technical]
technical = [w.replace('ict', '^ict') for w in technical]
technical = [w.replace('ml', '^ml') for w in technical]
technical = [w.replace('nlp', '^nlp') for w in technical]

print(normative)
print(technical)

['account', 'critic', 'democra', 'discrimin', 'equal', 'equit', 'ethic', 'fair', 'femin', 'gender', 'govern', 'histor', 'inequ', 'justic', 'law', 'legal', 'libert', 'moral', 'norm', 'philosoph', 'polit', 'power', 'privac', 'race', 'religi', 'respons', 'right', 'secur', 'social', 'societ', 'surveil', 'transpar', 'valu', 'polic']
['^ai', 'algorithm', 'analyt', 'intellig', 'automat', 'code', 'comput', '^cs', 'cyber', 'data', 'digit', '^ict', 'inform', 'intelligen', 'internet', 'machin', '^ml', 'process', '^nlp', 'platform', 'program', 'robot', 'softwar', 'system', 'technolog']


The process behind extracting relevant courses works in two steps:
1. First, we want to find and extract all courses that contain any instance of a normative keyword.
2. Then, we want search within these courses to see if it also contains a technical keyword.

We initialize a data frame with columns for all of the course items we want to extract. It probably makes the most sense to standardize these feature names across all university scripts so that they're easier to merge in the final compiled dataset for all universities. Our items of interest are:
* The course title: `title`
* The department and course number: `dept_num`
* The course description: `description`
* The number of credits for the course: `credits`
* The course instructor: `instructor`
* The link to the course syllabus (if applicable): `syllabus`
* The university the course is extracted from: `university`
* The term that the course is offered during (fall, spring, summer / year): `term`
* The keyword that triggered the extraction (this is for auditing purposes): `keyword`

In [18]:
#init dfs
cornell = pd.DataFrame(columns=['title','dept_num','description','credits','instructor',
                                'syllabus','university','term','keyword'])

The loop below executes part 1 of our extraction. It's long and kind of messy (sorry), so feel free to play around with the structure if you'd like. The key tasks here are to extract our items of interest based on our search queries and append them to our data frame.

In [19]:
#roster search for all urls
for url in urls:
    #loop through all normative words and extract relevant elements 
    for word in normative: 
        url_keyword = url + word #NOTE:this structure will likely be different between rosters!
        response = requests.get(url_keyword)
        soup = BeautifulSoup(response.content, 'lxml')
        #extract relevant elements 
        titles = [p.get_text() for p in soup.select(".title-coursedescr")] #get full desc
        dept_nums = [p.get_text() for p in soup.select('.title-subjectcode')]
        descs = [p.get_text() for p in soup.select('.course-descr')]
        credit = [p.get_text() for p in soup.select('.credit-val')]
        profs = [p.get_text() for p in soup.select('.instructors span.tooltip-iws')]
        syllabi = [p.get_text() for p in soup.select('.enrlgrp-syllabi')] #TODO: extract syllabi link
        syllabi = [p.replace('Syllabi:\n','') for p in syllabi] 
        uni = ['cornell university']
        term = [btn.get_text().strip() for btn in soup.select('button',{'class':'btn btn-default dropdown-toggle'})[:1]]       
        keyword = [word]
        for a,b,c,d,e,f,g,h,i in zip(titles,dept_nums,descs,credit,profs,syllabi,uni,term,keyword):
            cornell = cornell.append({'title': a, 
                                        'dept_num': b,
                                        'description': c,
                                        'credits': d,
                                        'instructor': e,
                                        'syllabus': f,
                                        'university': g,
                                        'term': h,
                                        'keyword': i}, ignore_index=True)

cornell

Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword
0,Financial Accounting,AEM 2210,Comprehensive introduction to financial accoun...,3,"Sinclair, J",1 available,cornell university,Fall 2017,account
1,Telling to Live: Critical Examinations of Test...,AMST 3680,Testimonio is a type of writing known in Latin...,4,"Diaz, E",none,cornell university,Fall 2017,critic
2,"Populism, Democracy & Authoritarianism",GOVT 3284,"Populist leaders, movements, and parties who c...",4,"Roberts, K",none,cornell university,Fall 2017,democra
3,Employment Discrimination and the Law,ILRLR 4842,Examines the laws against employment discrimin...,4,"Lieberwitz, R",none,cornell university,Fall 2017,discrimin
4,Controversies About Inequality,AMST 2225,"In recent years, poverty and inequality have b...",4,"Haskins, A",1 available,cornell university,Fall 2017,equal
5,Structural Barriers to Equity in Planning I,CRP 3106,This seminar will take a critical look at stru...,1,"Edmonds, K",1 available,cornell university,Fall 2017,equit
6,Engineering Ethics and Professional Practice,BEE 5400,An in-depth treatment of the ethical issues fa...,3,"Evans, R",1 available,cornell university,Fall 2017,ethic
7,National Security Affairs / Preparation for Ac...,AIRS 4401,This course is concerned with the national sec...,3,"Heath, M",none,cornell university,Fall 2017,fair
8,Topics in Feminist Media Arts,ARTH 4153,Feminist media arts continuously proliferate. ...,4,"Fernandez, M",none,cornell university,Fall 2017,femin
9,"Race, Gender, and Crossing Water: Narratives o...",AMST 6650,This course explores movement through and acro...,4,"Samuels, S",none,cornell university,Fall 2017,gender


Now that we've extracted all courses containing a normative keyword of interest, we need to filter our courses to only return titles that contain a normative AND a technical keyword. This is the case for all words except instances of our preprocessed `privac` and `secur`, for which we want to return all courses, even if they don't contain two keywords. To do this, we'll split the courses into two data frames, apply our respective conditions, and then merge them back together. 

In [6]:
exceptions = cornell.loc[(cornell['keyword']=='privac') | (cornell['keyword'] =='secur')]
exceptions

Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword
20,Security and Privacy Concepts in the Wild,CS 5435,This course will impart a technical and social...,3,"Juels, A",1 available,cornell university,Fall 2017,privac
25,Fixed-Income Securities,AEM 4260,Focuses on fixed-income securities including c...,4,"Bogan, V",none,cornell university,Fall 2017,secur
49,Privacy in the Digital Age,CS 5436,This course introduces students to privacy tec...,3-4,"Nissenbaum, H",none,cornell university,Spring 2018,privac
54,Practitioner's Overview of Securities Markets ...,AEM 3060,A broad overview of various aspects of the Fix...,1,"Edwards, A",1 available,cornell university,Spring 2018,secur
93,Security and Privacy Concepts in the Wild,CS 5435,This course will impart a technical and social...,3,"Juels, A",1 available,cornell university,Fall 2018,privac
98,National Security Affairs / Preparation for Ac...,AIRS 4401,This course is designed for college seniors an...,3,"Heath, M",none,cornell university,Fall 2018,secur
123,"Internet Law, Privacy and Security",LAW 6568,"This is a survey course in Internet law, with ...",3,"Grimmelmann, J",1 available,cornell university,Spring 2019,privac
128,Practitioner's Overview of Securities Markets ...,AEM 3060,A broad overview of various aspects of the Fix...,1,"Edwards, A",1 available,cornell university,Spring 2019,secur


In [20]:
print(cornell.title)

0                                   Financial Accounting
1      Telling to Live: Critical Examinations of Test...
2                 Populism, Democracy & Authoritarianism
3                  Employment Discrimination and the Law
4                         Controversies About Inequality
5            Structural Barriers to Equity in Planning I
6           Engineering Ethics and Professional Practice
7      National Security Affairs / Preparation for Ac...
8                          Topics in Feminist Media Arts
9      Race, Gender, and Crossing Water: Narratives o...
10      Introduction to American Government and Politics
11                             History Goes to Hollywood
12                        Controversies About Inequality
13     FWS: The Color and Class of Water: Environment...
14                                    Psychology and Law
15               Advanced Legal Research in Business Law
16                             Moral Dilemmas in the Law
17     Thinking from a Differen

In [24]:
#loop through technical keyword list, extract relevant titles
new = []
for row in cornell.title:
    flag = False
    for word in technical:
        if word in row.lower():
            flag = True
            new.append(word)
            continue
    if flag == False:
        new.append('')
print(new)
print(len(new))
print(len(cornell))

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'inform', '', 'system', '', '', '', '', '', '', '', '', 'technolog', '', '', '', '', '', '', '', '', '', '', 'digit', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'system', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'system', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'internet', '', '', '', '', '', '', '', '', '', '', '', 'inform', '', '', '', '', 'system', '', '', '', '', '', '', '']
148
148


In [28]:
cornell['new'] = new
cornell

Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword,new
0,Financial Accounting,AEM 2210,Comprehensive introduction to financial accoun...,3,"Sinclair, J",1 available,cornell university,Fall 2017,account,
1,Telling to Live: Critical Examinations of Test...,AMST 3680,Testimonio is a type of writing known in Latin...,4,"Diaz, E",none,cornell university,Fall 2017,critic,
2,"Populism, Democracy & Authoritarianism",GOVT 3284,"Populist leaders, movements, and parties who c...",4,"Roberts, K",none,cornell university,Fall 2017,democra,
3,Employment Discrimination and the Law,ILRLR 4842,Examines the laws against employment discrimin...,4,"Lieberwitz, R",none,cornell university,Fall 2017,discrimin,
4,Controversies About Inequality,AMST 2225,"In recent years, poverty and inequality have b...",4,"Haskins, A",1 available,cornell university,Fall 2017,equal,
5,Structural Barriers to Equity in Planning I,CRP 3106,This seminar will take a critical look at stru...,1,"Edmonds, K",1 available,cornell university,Fall 2017,equit,
6,Engineering Ethics and Professional Practice,BEE 5400,An in-depth treatment of the ethical issues fa...,3,"Evans, R",1 available,cornell university,Fall 2017,ethic,
7,National Security Affairs / Preparation for Ac...,AIRS 4401,This course is concerned with the national sec...,3,"Heath, M",none,cornell university,Fall 2017,fair,
8,Topics in Feminist Media Arts,ARTH 4153,Feminist media arts continuously proliferate. ...,4,"Fernandez, M",none,cornell university,Fall 2017,femin,
9,"Race, Gender, and Crossing Water: Narratives o...",AMST 6650,This course explores movement through and acro...,4,"Samuels, S",none,cornell university,Fall 2017,gender,


In [30]:
cornellNew = cornell[(cornell['new']!='') | (cornell['keyword']=='privac') | (cornell['keyword'] =='secur')]
cornellNew

Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword,new
20,Security and Privacy Concepts in the Wild,CS 5435,This course will impart a technical and social...,3,"Juels, A",1 available,cornell university,Fall 2017,privac,
25,Fixed-Income Securities,AEM 4260,Focuses on fixed-income securities including c...,4,"Bogan, V",none,cornell university,Fall 2017,secur,
27,Inventing an Information Society,AMST 2980,Explores the history of information technology...,3,"Kline, R",1 available,cornell university,Fall 2017,societ,inform
29,Toward a Sustainable Global Food System: Food ...,AEM 4450,Comprehensive presentation and discussion of p...,3,"Pingali, P",1 available,cornell university,Fall 2017,polic,system
38,"Gendering Religion, Science and Technology",AMST 2621,"There are several ""just-so stories"" about scie...",4,"Rock-Singer, C",none,cornell university,Spring 2018,gender,technolog
49,Privacy in the Digital Age,CS 5436,This course introduces students to privacy tec...,3-4,"Nissenbaum, H",none,cornell university,Spring 2018,privac,digit
54,Practitioner's Overview of Securities Markets ...,AEM 3060,A broad overview of various aspects of the Fix...,1,"Edwards, A",1 available,cornell university,Spring 2018,secur,
66,The American Legal System,GOVT 3150,This course offers a comprehensive introductio...,4,"Stewart, C",none,cornell university,Summer 2018,legal,system
93,Security and Privacy Concepts in the Wild,CS 5435,This course will impart a technical and social...,3,"Juels, A",1 available,cornell university,Fall 2018,privac,
98,National Security Affairs / Preparation for Ac...,AIRS 4401,This course is designed for college seniors an...,3,"Heath, M",none,cornell university,Fall 2018,secur,


Lastly, we want to export our csv. Ideally, all csv files should be written to the courses directory in our repository. 

In [31]:
#export as csv
cornellNew.to_csv('../courses/cornell.csv')