## Oregon State University Crawler

Imports.

In [1]:
import pandas as pd
import numpy as np
import re
import urllib.request #handles urls
import urllib.parse 
import linkGrabber #extracts urls
import json #encodes/decodes json 
import csv 
import requests #downloads a webpage to scrape
from bs4 import BeautifulSoup, NavigableString, Tag #beautifulsoup pulls data from HTML
import nltk #NLP tasks
from nltk import word_tokenize
from nltk.stem import PorterStemmer #removes word endings
stemmer = PorterStemmer()

Keyword preprocessing and url list of relevant catalog years; 2019-20 and 2018-19. Also a list of departments.

In [2]:
#keyword preprocessing
def preprocess(keyword):
    keyword = keyword.lower() #lowercase
    keyword = word_tokenize(keyword) #tokenize
    for word in keyword:
        keyword = stemmer.stem(word) #stem 
    return (keyword)

#course catalog URLs - 2 academic years 
urls = ['https://catalog.oregonstate.edu/courses/',
        'https://catalog.oregonstate.edu/archives/2018-2019/courses/']

#list of all the departments to search through 
departments = ['als/','actg/','ahe/','aae/','as/','aed/',
              'agri/','ag/','asl/','ams/','ans/','anth/',
              'aec/','aj/','arab/','art/','asn/','ats/',
              'bb/','bhs/','bee/','bds/','bioe/','bi/',
              'brr/','bot/','ba/','cbee/','che/','ch/',
              'chn/','cce/','ce/','cssa/','comm/','cs/',
              'cem/','coun/','css/','crop/','dsgn/','dhe/',
              'econ/','ed/','ece/','ese/','engr/','eng/',
              'ent/','eah/','enve/','ensc/','es/','film/',
              'fin/','fw/','fcsj/','fst/','fes/','fe/',
              'for/','fr/','gs/','geog/','gph/','geo/',
              'ger/','grad/','gd/','hhs/','herb/','hst/',
              'hsts/','hc/','hort/','hm/','hdfs/','hest/',
              'ie/','ib/','iepa/','iepg/','ieph/','ist/',
              'intl/','it/','jpn/','kin/','kor/','lead/',
              'la/','ls/','lib/','ling/','mgmt/','mfge/',
              'mrm/','mrkt/','mnr/','mpp/','mats/','mth/',
              'mime/','me/','mb/','ms/','mcb/','mus/','mup',
              'mued/','nr/','ns/','nmc/','nse/','nur/',
              'nutr/','oeas/','oc/','op/','pax/','phar/',
              'phl/','pac/','pt/','ph/','pbg/','ps/','psm/',
              'psy/','h/','ppol/','qs/','rng/','rel/','rob/',
              'rs/','rus/','sed/','ssci/','soc/','se/','soil/',
              'span/','st/','sus/','snr/','ta/','tral/','tox/',
              'tcs/','uexp/','vmb/','vmc/','wre/','wrp/',
              'wrs/','wgss/','wse/','wlc/','wr','z/']

Creation of normative and technical keywords lists, the same as in example crawler.

In [3]:
#import keywords
keywords = pd.read_csv("keywords.csv")
technical = keywords[(keywords['Technical/Normative']=='T') & (keywords['Include']=='Y')].Keyword
normative = keywords[(keywords['Technical/Normative']=='N') & (keywords['Include']=='Y')].Keyword
normative = [preprocess(i) for i in normative]
technical = [preprocess(i) for i in technical] 

#replace keywords of interest
normative = [w.replace('privaci', 'privac') for w in normative]
normative = [w.replace('democraci', 'democra') for w in normative]
normative = [w.replace('equiti', 'equit') for w in normative]
normative = [w.replace('histori', 'histor') for w in normative]
normative = [w.replace('justice', 'justic') for w in normative]
normative = [w.replace('liberti', 'libert') for w in normative]
normative = [w.replace('philosophi', 'philosoph') for w in normative]
normative = [w.replace('societi', 'societ') for w in normative]
normative = [w.replace('polici', 'polic') for w in normative]

technical = [w.replace('ai', '^ai') for w in technical]
technical = [w.replace('cs', '^cs') for w in technical]
technical = [w.replace('ict', '^ict') for w in technical]
technical = [w.replace('ml', '^ml') for w in technical]
technical = [w.replace('nlp', '^nlp') for w in technical]

print(normative)
print(technical)

['account', 'critic', 'democra', 'discrimin', 'equal', 'equit', 'ethic', 'fair', 'femin', 'gender', 'govern', 'histor', 'inequ', 'justic', 'law', 'legal', 'libert', 'moral', 'norm', 'philosoph', 'polit', 'power', 'privac', 'race', 'religi', 'respons', 'right', 'secur', 'social', 'societ', 'surveil', 'transpar', 'valu', 'polic']
['^ai', 'algorithm', 'analyt', 'intellig', 'automat', 'code', 'comput', '^cs', 'cyber', 'data', 'digit', '^ict', 'inform', 'intelligen', 'internet', 'machin', '^ml', 'process', '^nlp', 'platform', 'program', 'robot', 'softwar', 'system', 'technolog']


Extraction process for Oregon State:
1. Loop through each years' catalog.
2. Loop through each of the departments' pages by concatenating the department code to the catalog url. For Oregon state, all the courses are listed on each departments page.
3. On the department page, make a list of all the courses by selecting the class id courseblock. This creates a list where each element is a list containing the full course title and description.
4. Loop through all the keywords in the normative list and check to see if the keyword can be found in the full course title.
5. If the keyword is in the title, then assign every element of the data columns that can be located.

Data columns are defined in the same way as below and have the same anatomy for each course:
* The course title - in between the first and last occurances of '.' in the full title: `title`
* The department and course number - before the first occurance of '.' in the full title: `dept_num`
* The course description - the third list element for the course: `description`
* The number of credits for the course - after the first occurance of '(' in the full title: `credits`
* The course instructor - school does not list in catalog: `instructor`
* The link to the course syllabus (if applicable) - school does not list in catalog: `syllabus`
* The university the course is extracted from - all from the same university: `university`
* The term that the course is offered during (fall, spring, summer / year) - school only differentiates by year, not term: `term`
* The keyword that triggered the extraction (this is for auditing purposes): `keyword`

In [4]:
#init dfs
oregon = pd.DataFrame(columns=['title','dept_num','description','credits','instructor',
                                'syllabus','university','term','keyword'])
titles = []
dept_nums = []
descs = []
credit = []
profs = []
syllabi = []
uni = []
term = []     
keyword = []
URL = []

The extraction process. The process to create the table is kept the same as the example crawler, just as a loop on it's own after all the titles, credits, etc. are all gathered.

In [6]:
#looping through each years catalog
for url in urls:
    #looping through all the departments pages to process individual course's information
    for dept in departments:
        page_link = url + dept
        page_response = requests.get(page_link)
        soup = BeautifulSoup(page_response.content, 'html.parser')
        courses = [p.get_text().split('\n') for p in soup.select(".courseblock")]
        for crs in courses:
            title = crs[1]
            for word in normative:
                if word in title.lower():
                    URL.append(page_link)
                    titles.append(title[title.find('.')+1:title.rfind('.')])
                    dept_nums.append(title[:title.find('.')])
                    descs.append(crs[2])
                    credit.append(title[title.find('(')+1:len(title)-1])
                    profs.append('Not Listed')
                    syllabi.append('Not Listed')
                    uni.append('Oregon State University')
                    if url == 'https://catalog.oregonstate.edu/courses/':
                        term.append('2019-20')
                    else:
                        term.append('2018-19')
                    keyword.append(word)
            
for a,b,c,d,e,f,g,h,i,j in zip(titles,dept_nums,descs,credit,profs,syllabi,uni,term,keyword,URL):
    oregon = oregon.append({'title': a, 
                              'dept_num': b,
                              'description': c,
                              'credits': d,
                              'instructor': e,
                              'syllabus': f,
                              'university': g,
                              'term': h,
                              'keyword': i,
                              'URL': j}, ignore_index=True)


Post filtering of course. Code is identical to that of example crawler.

In [7]:
exceptions = oregon.loc[(oregon['keyword']=='privac') | (oregon['keyword'] =='secur')]
exceptions

Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword,URL
32,NATIONAL SECURITY AFFAIRS,AS 411,"Emphasis on the needs for national security, e...",3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur,https://catalog.oregonstate.edu/courses/as/
150,INFORMATION SYSTEMS SECURITY,BA 480,Course emphasis is on security risk mitigation...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur,https://catalog.oregonstate.edu/courses/ba/
152,INFORMATION SECURITY GOVERNANCE,BA 482,"As a discipline cybersecurity covers software,...",4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur,https://catalog.oregonstate.edu/courses/ba/
186,*COMMUNICATIONS SECURITY AND SOCIAL MOVEMENTS,CS 175,Equipping students with the theory and practic...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur,https://catalog.oregonstate.edu/courses/cs/
188,INTRODUCTION TO SECURITY,CS 370,Introductory course on computer security with ...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur,https://catalog.oregonstate.edu/courses/cs/
193,NETWORK SECURITY,CS 478,Basic concepts and techniques in network secur...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur,https://catalog.oregonstate.edu/courses/cs/
194,CYBER-SECURITY,CS 578,A broad overview of the field of computer and ...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur,https://catalog.oregonstate.edu/courses/cs/
254,NETWORK SECURITY,ECE 478,Basic concepts and techniques in network secur...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur,https://catalog.oregonstate.edu/courses/ece/
258,DATA SECURITY AND CRYPTOGRAPHY,ECE 575,"Secret-key and public-key cryptography, authen...",3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur,https://catalog.oregonstate.edu/courses/ece/
259,CYBER-SECURITY,ECE 578,A broad overview of the field of computer and ...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,secur,https://catalog.oregonstate.edu/courses/ece/


In [8]:
#loop through technical keyword list, extract relevant titles
for word in technical:
    df = oregon[oregon['title'].str.contains(word, flags = re.IGNORECASE)]
    df['keyword2'] = word
    
#join keyword cols
df["keyword"] = df["keyword"].map(str) + "," + df["keyword2"]
df = df.drop(columns="keyword2")

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword,URL
39,"*EVOLUTION OF PEOPLE, TECHNOLOGY, AND SOCIETY",ANTH 330,Overview of the evolution and prehistory of th...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"societ,technolog",https://catalog.oregonstate.edu/courses/anth/
88,"*PHOTOGRAPHY: HISTORY, TECHNOLOGY, CULTURE A...",ART 264,Introduction to the history of photography thr...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"histor,technolog",https://catalog.oregonstate.edu/courses/art/
123,*ENERGY TECHNOLOGY AND SOCIAL CHANGE,BRR 325,Science and technology co-evolve with a prospe...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog",https://catalog.oregonstate.edu/courses/brr/
156,BUSINESS LAW - TECHNOLOGY/NEW VENTURES,BA 531,An integrative course on managing legal and et...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"law,technolog",https://catalog.oregonstate.edu/courses/ba/
260,"THE SCIENCE, ENGINEERING AND SOCIAL IMPACT O...",ENGR 221,Nanotechnology is an emerging engineering fiel...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog",https://catalog.oregonstate.edu/courses/engr/
598,"THE SCIENCE, ENGINEERING AND SOCIAL IMPACT O...",MATS 221,Nanotechnology is an emerging engineering fiel...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog",https://catalog.oregonstate.edu/courses/mats/
624,*SOCIETAL ASPECTS OF NUCLEAR TECHNOLOGY,NSE 319,Description and discussion of nuclear-related ...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"societ,technolog",https://catalog.oregonstate.edu/courses/nse/
1012,*SCIENCE AND TECHNOLOGY IN SOCIAL CONTEXT,SOC 456,Study of social aspects of science and technol...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog",https://catalog.oregonstate.edu/courses/soc/
1030,SCIENCE AND TECHNOLOGY IN SOCIAL CONTEXT,SOC 556,Study of social aspects of science and technol...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog",https://catalog.oregonstate.edu/courses/soc/
1069,*GENDER AND TECHNOLOGY,WGSS 320,Explores women's contributions and focuses in ...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"gender,technolog",https://catalog.oregonstate.edu/courses/wgss/


In [9]:
#combine dfs 
oregon = pd.concat([df, exceptions])
oregon

Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword,URL
39,"*EVOLUTION OF PEOPLE, TECHNOLOGY, AND SOCIETY",ANTH 330,Overview of the evolution and prehistory of th...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"societ,technolog",https://catalog.oregonstate.edu/courses/anth/
88,"*PHOTOGRAPHY: HISTORY, TECHNOLOGY, CULTURE A...",ART 264,Introduction to the history of photography thr...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"histor,technolog",https://catalog.oregonstate.edu/courses/art/
123,*ENERGY TECHNOLOGY AND SOCIAL CHANGE,BRR 325,Science and technology co-evolve with a prospe...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog",https://catalog.oregonstate.edu/courses/brr/
156,BUSINESS LAW - TECHNOLOGY/NEW VENTURES,BA 531,An integrative course on managing legal and et...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"law,technolog",https://catalog.oregonstate.edu/courses/ba/
260,"THE SCIENCE, ENGINEERING AND SOCIAL IMPACT O...",ENGR 221,Nanotechnology is an emerging engineering fiel...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog",https://catalog.oregonstate.edu/courses/engr/
598,"THE SCIENCE, ENGINEERING AND SOCIAL IMPACT O...",MATS 221,Nanotechnology is an emerging engineering fiel...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog",https://catalog.oregonstate.edu/courses/mats/
624,*SOCIETAL ASPECTS OF NUCLEAR TECHNOLOGY,NSE 319,Description and discussion of nuclear-related ...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"societ,technolog",https://catalog.oregonstate.edu/courses/nse/
1012,*SCIENCE AND TECHNOLOGY IN SOCIAL CONTEXT,SOC 456,Study of social aspects of science and technol...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog",https://catalog.oregonstate.edu/courses/soc/
1030,SCIENCE AND TECHNOLOGY IN SOCIAL CONTEXT,SOC 556,Study of social aspects of science and technol...,4 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"social,technolog",https://catalog.oregonstate.edu/courses/soc/
1069,*GENDER AND TECHNOLOGY,WGSS 320,Explores women's contributions and focuses in ...,3 Credits,Not Listed,Not Listed,Oregon State University,2019-20,"gender,technolog",https://catalog.oregonstate.edu/courses/wgss/


Exporting of code to csv.

In [10]:
#export as csv
oregon.to_csv('7-Oregon State University.csv')