## Oregon State University Crawler

Imports.

In [1]:
import pandas as pd
import numpy as np
import re
import urllib.request #handles urls
import urllib.parse 
import linkGrabber #extracts urls
import json #encodes/decodes json 
import csv 
import requests #downloads a webpage to scrape
from bs4 import BeautifulSoup, NavigableString, Tag #beautifulsoup pulls data from HTML
import nltk #NLP tasks
from nltk import word_tokenize
from nltk.stem import PorterStemmer #removes word endings
stemmer = PorterStemmer()

Keyword preprocessing and url list of relevant catalog years; 2019-20 and 2018-19. Also a list of departments.

In [2]:
#keyword preprocessing
def preprocess(keyword):
    keyword = keyword.lower() #lowercase
    keyword = word_tokenize(keyword) #tokenize
    for word in keyword:
        keyword = stemmer.stem(word) #stem 
    return (keyword)

#course catalog URLs - 2 academic years 
urls = ['https://catalog.oregonstate.edu/courses/',
        'https://catalog.oregonstate.edu/archives/2018-2019/courses/']

#list of all the departments to search through 
departments = ['als/','actg/','ahe/','aae/','as/','aed/',
              'agri/','ag/','asl/','ams/','ans/','anth/',
              'aec/','aj/','arab/','art/','asn/','ats/',
              'bb/','bhs/','bee/','bds/','bioe/','bi/',
              'brr/','bot/','ba/','cbee/','che/','ch/',
              'chn/','cce/','ce/','cssa/','comm/','cs/',
              'cem/','coun/','css/','crop/','dsgn/','dhe/',
              'econ/','ed/','ece/','ese/','engr/','eng/',
              'ent/','eah/','enve/','ensc/','es/','film/',
              'fin/','fw/','fcsj/','fst/','fes/','fe/',
              'for/','fr/','gs/','geog/','gph/','geo/',
              'ger/','grad/','gd/','hhs/','herb/','hst/',
              'hsts/','hc/','hort/','hm/','hdfs/','hest/',
              'ie/','ib/','iepa/','iepg/','ieph/','ist/',
              'intl/','it/','jpn/','kin/','kor/','lead/',
              'la/','ls/','lib/','ling/','mgmt/','mfge/',
              'mrm/','mrkt/','mnr/','mpp/','mats/','mth/',
              'mime/','me/','mb/','ms/','mcb/','mus/','mup',
              'mued/','nr/','ns/','nmc/','nse/','nur/',
              'nutr/','oeas/','oc/','op/','pax/','phar/',
              'phl/','pac/','pt/','ph/','pbg/','ps/','psm/',
              'psy/','h/','ppol/','qs/','rng/','rel/','rob/',
              'rs/','rus/','sed/','ssci/','soc/','se/','soil/',
              'span/','st/','sus/','snr/','ta/','tral/','tox/',
              'tcs/','uexp/','vmb/','vmc/','wre/','wrp/',
              'wrs/','wgss/','wse/','wlc/','wr','z/']

Next, we'll want to import our keyword csv, split our keyword lists, and preprocess them. The way the csv is set up, we'll want to split the words that are indicated as technical (`T`) or normative (`N`) and that we've chosen to include (`Y`). You'll notice that preprocessing is useful for some of our words but not for others. Here, we've chosen to manually alter words that are not usefully preprocessed. In this case, it means replacing instances of words that are stemmed to end in i.

[regex is a bitch here]

In [3]:
#import keywords
keywords = pd.read_csv("keywords.csv")
technical = keywords[(keywords['Technical/Normative']=='T') & (keywords['Include']=='Y')].Keyword
normative = keywords[(keywords['Technical/Normative']=='N') & (keywords['Include']=='Y')].Keyword
normative = [preprocess(i) for i in normative]
technical = [preprocess(i) for i in technical] 

#replace keywords of interest
normative = [w.replace('privaci', 'privac') for w in normative]
normative = [w.replace('democraci', 'democra') for w in normative]
normative = [w.replace('equiti', 'equit') for w in normative]
normative = [w.replace('histori', 'histor') for w in normative]
normative = [w.replace('justice', 'justic') for w in normative]
normative = [w.replace('liberti', 'libert') for w in normative]
normative = [w.replace('philosophi', 'philosoph') for w in normative]
normative = [w.replace('societi', 'societ') for w in normative]
normative = [w.replace('polici', 'polic') for w in normative]

technical = [w.replace('ai', '^ai') for w in technical]
technical = [w.replace('cs', '^cs') for w in technical]
technical = [w.replace('ict', '^ict') for w in technical]
technical = [w.replace('ml', '^ml') for w in technical]
technical = [w.replace('nlp', '^nlp') for w in technical]

print(normative)
print(technical)

['account', 'critic', 'democra', 'discrimin', 'equal', 'equit', 'ethic', 'fair', 'femin', 'gender', 'govern', 'histor', 'inequ', 'justic', 'law', 'legal', 'libert', 'moral', 'norm', 'philosoph', 'polit', 'power', 'privac', 'race', 'religi', 'respons', 'right', 'secur', 'social', 'societ', 'surveil', 'transpar', 'valu', 'polic']
['^ai', 'algorithm', 'analyt', 'intellig', 'automat', 'code', 'comput', '^cs', 'cyber', 'data', 'digit', '^ict', 'inform', 'intelligen', 'internet', 'machin', '^ml', 'process', '^nlp', 'platform', 'program', 'robot', 'softwar', 'system', 'technolog']


The process behind extracting relevant courses works in two steps:
1. First, we want to find and extract all courses that contain any instance of a normative keyword.
2. Then, we want search within these courses to see if it also contains a technical keyword.

We initialize a data frame with columns for all of the course items we want to extract. It probably makes the most sense to standardize these feature names across all university scripts so that they're easier to merge in the final compiled dataset for all universities. Our items of interest are:
* The course title: `title`
* The department and course number: `dept_num`
* The course description: `description`
* The number of credits for the course: `credits`
* The course instructor: `instructor`
* The link to the course syllabus (if applicable): `syllabus`
* The university the course is extracted from: `university`
* The term that the course is offered during (fall, spring, summer / year): `term`
* The keyword that triggered the extraction (this is for auditing purposes): `keyword`

In [4]:
#init dfs
oregon = pd.DataFrame(columns=['title','dept_num','description','credits','instructor',
                                'syllabus','university','term','keyword'])
titles = []
dept_nums = []
descs = []
credit = []
profs = []
syllabi = []
uni = []
term = []     
keyword = []

The loop below executes part 1 of our extraction. It's long and kind of messy (sorry), so feel free to play around with the structure if you'd like. The key tasks here are to extract our items of interest based on our search queries and append them to our data frame.

In [5]:
#looping through each years catalog
for url in urls:
    #looping through all the departments pages to process individual course's information
    for dept in departments:
        page_link = url + dept
        page_response = requests.get(page_link)
        soup = BeautifulSoup(page_response.content, 'html.parser')
        courses = [p.get_text().split('\n') for p in soup.select(".courseblock")]
        for crs in courses:
            title = crs[1]
            for word in normative:
                if word in title.lower():
                    titles.append(title[title.find('.')+1:title.rfind('.')])
                    dept_nums.append(title[:title.find('.')])
                    descs.append(crs[2])
                    credit.append(title[title.find('(')+1:len(title)-1])
                    profs.append('Not Listed')
                    syllabi.append('Not Listed')
                    uni.append('Oregon State University')
                    if url == 'https://catalog.oregonstate.edu/courses/':
                        term.append('2019-20')
                    else:
                        term.append('2018-19')
                    keyword.append(word)
            
for a,b,c,d,e,f,g,h,i in zip(titles,dept_nums,descs,credit,profs,syllabi,uni,term,keyword):
    oregon = oregon.append({'title': a, 
                              'dept_num': b,
                              'description': c,
                              'credits': d,
                              'instructor': e,
                              'syllabus': f,
                              'university': g,
                              'term': h,
                              'keyword': i}, ignore_index=True)


Now that we've extracted all courses containing a normative keyword of interest, we need to filter our courses to only return titles that contain a normative AND a technical keyword. This is the case for all words except instances of our preprocessed `privac` and `secur`, for which we want to return all courses, even if they don't contain two keywords. To do this, we'll split the courses into two data frames, apply our respective conditions, and then merge them back together. 

In [6]:
exceptions = oregon.loc[(oregon['keyword']=='privac') | (oregon['keyword'] =='secur')]
exceptions

Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword
32,NATIONAL SECURITY AFFAIRS,AS 411,"Emphasis on the needs for national security, e...",3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur
150,INFORMATION SYSTEMS SECURITY,BA 480,Course emphasis is on security risk mitigation...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur
152,INFORMATION SECURITY GOVERNANCE,BA 482,"As a discipline cybersecurity covers software,...",4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur
186,*COMMUNICATIONS SECURITY AND SOCIAL MOVEMENTS,CS 175,Equipping students with the theory and practic...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur
188,INTRODUCTION TO SECURITY,CS 370,Introductory course on computer security with ...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur
193,NETWORK SECURITY,CS 478,Basic concepts and techniques in network secur...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur
194,CYBER-SECURITY,CS 578,A broad overview of the field of computer and ...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur
254,NETWORK SECURITY,ECE 478,Basic concepts and techniques in network secur...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur
258,DATA SECURITY AND CRYPTOGRAPHY,ECE 575,"Secret-key and public-key cryptography, authen...",3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur
259,CYBER-SECURITY,ECE 578,A broad overview of the field of computer and ...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur


In [7]:
#loop through technical keyword list, extract relevant titles
for word in technical:
    df = oregon[oregon['title'].str.contains(word, flags = re.IGNORECASE)]
    df['keyword2'] = word
    
#join keyword cols
df["keyword"] = df["keyword"].map(str) + "," + df["keyword2"]
df = df.drop(columns="keyword2")

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword
39,"*EVOLUTION OF PEOPLE, TECHNOLOGY, AND SOCIETY",ANTH 330,Overview of the evolution and prehistory of th...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"societ,technolog"
88,"*PHOTOGRAPHY: HISTORY, TECHNOLOGY, CULTURE A...",ART 264,Introduction to the history of photography thr...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"histor,technolog"
123,*ENERGY TECHNOLOGY AND SOCIAL CHANGE,BRR 325,Science and technology co-evolve with a prospe...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog"
156,BUSINESS LAW - TECHNOLOGY/NEW VENTURES,BA 531,An integrative course on managing legal and et...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"law,technolog"
260,"THE SCIENCE, ENGINEERING AND SOCIAL IMPACT O...",ENGR 221,Nanotechnology is an emerging engineering fiel...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog"
598,"THE SCIENCE, ENGINEERING AND SOCIAL IMPACT O...",MATS 221,Nanotechnology is an emerging engineering fiel...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog"
624,*SOCIETAL ASPECTS OF NUCLEAR TECHNOLOGY,NSE 319,Description and discussion of nuclear-related ...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"societ,technolog"
1011,*SCIENCE AND TECHNOLOGY IN SOCIAL CONTEXT,SOC 456,Study of social aspects of science and technol...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog"
1029,SCIENCE AND TECHNOLOGY IN SOCIAL CONTEXT,SOC 556,Study of social aspects of science and technol...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog"
1068,*GENDER AND TECHNOLOGY,WGSS 320,Explores women's contributions and focuses in ...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"gender,technolog"


NOTE: the above cell is likely not the best nor most simple way to execute this step! Feel free to take special liberties here. It's probably wise to pick out a few titles that you know should be returned manually, then check to see if the script is working as desired. 

In [8]:
#combine dfs 
oregon = pd.concat([df, exceptions])
oregon

Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword
39,"*EVOLUTION OF PEOPLE, TECHNOLOGY, AND SOCIETY",ANTH 330,Overview of the evolution and prehistory of th...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"societ,technolog"
88,"*PHOTOGRAPHY: HISTORY, TECHNOLOGY, CULTURE A...",ART 264,Introduction to the history of photography thr...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"histor,technolog"
123,*ENERGY TECHNOLOGY AND SOCIAL CHANGE,BRR 325,Science and technology co-evolve with a prospe...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog"
156,BUSINESS LAW - TECHNOLOGY/NEW VENTURES,BA 531,An integrative course on managing legal and et...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"law,technolog"
260,"THE SCIENCE, ENGINEERING AND SOCIAL IMPACT O...",ENGR 221,Nanotechnology is an emerging engineering fiel...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog"
598,"THE SCIENCE, ENGINEERING AND SOCIAL IMPACT O...",MATS 221,Nanotechnology is an emerging engineering fiel...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog"
624,*SOCIETAL ASPECTS OF NUCLEAR TECHNOLOGY,NSE 319,Description and discussion of nuclear-related ...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"societ,technolog"
1011,*SCIENCE AND TECHNOLOGY IN SOCIAL CONTEXT,SOC 456,Study of social aspects of science and technol...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog"
1029,SCIENCE AND TECHNOLOGY IN SOCIAL CONTEXT,SOC 556,Study of social aspects of science and technol...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog"
1068,*GENDER AND TECHNOLOGY,WGSS 320,Explores women's contributions and focuses in ...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"gender,technolog"


Lastly, we want to export our csv. Ideally, all csv files should be written to the courses directory in our repository. 

In [10]:
#export as csv
oregon.to_csv('6-Oregon State University.csv')