## Georgia Southern University Crawler

Imports.

In [1]:
import pandas as pd
import numpy as np
import re
import urllib.request #handles urls
import urllib.parse 
import linkGrabber #extracts urls
import json #encodes/decodes json 
import csv 
import requests #downloads a webpage to scrape
from bs4 import BeautifulSoup, NavigableString, Tag #beautifulsoup pulls data from HTML
import nltk #NLP tasks
from nltk import word_tokenize
from nltk.stem import PorterStemmer #removes word endings
stemmer = PorterStemmer()

Keyword preprocessing and url list of relevant catalog years; 2019-20 and 2018-19. Also a list of departments.

In [2]:
#keyword preprocessing
def preprocess(keyword):
    keyword = keyword.lower() #lowercase
    keyword = word_tokenize(keyword) #tokenize
    for word in keyword:
        keyword = stemmer.stem(word) #stem 
    return (keyword)

#course catalog URLs - 2 academic years 
urls = ['http://coursecat.isu.edu/undergraduate/allcourses/',
        'http://coursecat.isu.edu/previouscatalogs/2018-19/undergraduate/allcourses/']

#list of all the departments to search through 
departments = ['acad/','acct/','admt/','airm/','amst/',
               'anth/','arbc/','art/','acrr/','autm/',
               'biol/','psci/','bed/','ba/','bt/','cte/',
               'chem/','cfs/','chld/','chns/','ce/','cet/',
               'comm/','cmp/','csd/','cpar/','cmlt/','cadd/',
               'mach/','cs/','cosm/','coun/','daac/','danc/',
               'dent/','dms/','desl/','ntd/','dhs/','econ/',
               'educ/','elap/','ee/','eet/','ems/','emgt/',
               'eset/','engl/','enve/','fcs/','fin/','fsa/',
               'fren/','gemt/','geol/','germ/','glbl/','hca/',
               'he/','hit/','ho/','hphy/','hist/','hrd/','idep/',
               'info/','its/','inst/','is/','japn/','lang/',
               'latn/','lawe/','llib/','mgt/','mktg/','msth/',
               'math/','me/','ma/','mls/','msl/','muse/','musa/',
               'musi/','musc/','musp/','nse/','ne/','nurs/','ota/',
               'opt/','olp/','para/','parm/','phar/','ppra/','phtc/',
               'phil/','peac/','pe/','ptot/','pta/','pas/','phys/',
               'plap/','pols/','pnur/','pte/','psyc/','rs/','adrn/',
               'resp/','rcet/','russ/','scpy/','shos/','sowk/','soc/',
               'span/','sped/','stua/','tge/','thea/','hons/','us/',
               'uas/','weld/']

Creation of normative and technical keywords lists, the same as in example crawler.

In [3]:
#import keywords
keywords = pd.read_csv("keywords.csv")
technical = keywords[(keywords['Technical/Normative']=='T') & (keywords['Include']=='Y')].Keyword
normative = keywords[(keywords['Technical/Normative']=='N') & (keywords['Include']=='Y')].Keyword
normative = [preprocess(i) for i in normative]
technical = [preprocess(i) for i in technical] 

#replace keywords of interest
normative = [w.replace('privaci', 'privac') for w in normative]
normative = [w.replace('democraci', 'democra') for w in normative]
normative = [w.replace('equiti', 'equit') for w in normative]
normative = [w.replace('histori', 'histor') for w in normative]
normative = [w.replace('justice', 'justic') for w in normative]
normative = [w.replace('liberti', 'libert') for w in normative]
normative = [w.replace('philosophi', 'philosoph') for w in normative]
normative = [w.replace('societi', 'societ') for w in normative]
normative = [w.replace('polici', 'polic') for w in normative]

technical = [w.replace('ai', '^ai') for w in technical]
technical = [w.replace('cs', '^cs') for w in technical]
technical = [w.replace('ict', '^ict') for w in technical]
technical = [w.replace('ml', '^ml') for w in technical]
technical = [w.replace('nlp', '^nlp') for w in technical]

print(normative)
print(technical)

['account', 'critic', 'democra', 'discrimin', 'equal', 'equit', 'ethic', 'fair', 'femin', 'gender', 'govern', 'histor', 'inequ', 'justic', 'law', 'legal', 'libert', 'moral', 'norm', 'philosoph', 'polit', 'power', 'privac', 'race', 'religi', 'respons', 'right', 'secur', 'social', 'societ', 'surveil', 'transpar', 'valu', 'polic']
['^ai', 'algorithm', 'analyt', 'intellig', 'automat', 'code', 'comput', '^cs', 'cyber', 'data', 'digit', '^ict', 'inform', 'intelligen', 'internet', 'machin', '^ml', 'process', '^nlp', 'platform', 'program', 'robot', 'softwar', 'system', 'technolog']


Extraction process for Georgia Southern University:
1. Loop through each years' catalog.
2. Loop through each of the departments' pages by concatenating the department code to the catalog url
3. On the department page, make a list of all the course titles, credits, and descriptions
4. Loop through all the keywords in the normative list and check to see if the keyword can be found in the course title
5. If the keyword is in the title, then assign every element of the data columns that can be located.

Data columns are defined in the same way as below and have the same anatomy for each course:
* The course title - in the title list, the string sequence after the last instance of \xa0: `title`
* The department and course number - in the title list, the string sequence before the first instance of \xa0: `dept_num`
* The course description - items in the description list: `description`
* The number of credits for the course - items in the credit list: `credits`
* The course instructor - school does not list in catalog: `instructor`
* The link to the course syllabus (if applicable) - school does not list in catalog: `syllabus`
* The university the course is extracted from - all from the same university: `university`
* The term that the course is offered during (fall, spring, summer / year) - Only matched by year through url: `term`
* The keyword that triggered the extraction (this is for auditing purposes): `keyword`

In [4]:
#init dfs
idaho = pd.DataFrame(columns=['title','dept_num','description','credits','instructor',
                                'syllabus','university','term','keyword'])
titles = []
dept_nums = []
descs = []
credit = []
profs = []
syllabi = []
uni = []
term = []     
keyword = []

The extraction process. The process to create the table is kept the same as the example crawler, just as a loop on it's own after all the titles, credits, etc. are all gathered.

In [5]:
#looping through each years catalog
for url in urls:
    #looping through all the departments pages to process individual course's information
    for dept in departments:
        page_link = url + dept
        page_response = requests.get(page_link)
        soup = BeautifulSoup(page_response.content, 'html.parser')
        courses = [p.get_text().split('\n') for p in soup.select(".courseblock")]
        for crs in courses:
            title = crs[1]
            for word in normative:
                if word in title.lower():
                    titles.append(title[title.find('  '):title.rfind(':')])
                    dept_nums.append(title[:title.find('  ')])
                    if(crs[3]==''): descs.append('No description available.')
                    else: descs.append(crs[3][:crs[3].rfind('.')])
                    credit.append(title[title.rfind('  '):len(title)-1])
                    profs.append('Not Listed')
                    syllabi.append('Not Listed')
                    uni.append('Idaho State University')
                    if(url=='http://coursecat.isu.edu/undergraduate/allcourses/'): 
                        if(len(crs[3][crs[3].rfind('.')+1:])>7): term.append('2019-20')
                        else: term.append('2019-20' + crs[3][crs[3].rfind('.')+1:])
                    elif(len(crs[3][crs[3].rfind('.')+1:])>7): term.append('2018-19')
                    else: term.append('2018-19' + crs[3][crs[3].rfind('.')+1:])
                    keyword.append(word)
            
for a,b,c,d,e,f,g,h,i in zip(titles,dept_nums,descs,credit,profs,syllabi,uni,term,keyword):
    idaho = idaho.append({'title': a, 
                              'dept_num': b,
                              'description': c,
                              'credits': d,
                              'instructor': e,
                              'syllabus': f,
                              'university': g,
                              'term': h,
                              'keyword': i}, ignore_index=True)


Post filtering of course. Code is identical to that of example crawler.

In [6]:
exceptions = idaho.loc[(idaho['keyword']=='privac') | (idaho['keyword'] =='secur')]
exceptions

Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword
58,Computer Security and Cryptography,CS 4420,"Public key and private key cryptography, key d...",3 semester hours,Not Listed,Not Listed,Idaho State University,2019-20 D,secur
59,Secure Software Engineering,CS 4424,Introduction to the Secure Software Developmen...,3 semester hours,Not Listed,Not Listed,Idaho State University,2019-20 D,secur
107,Wireless Network Security,ESET 0282,Overview of wireless networks with a focus on ...,3 semester hours,Not Listed,Not Listed,Idaho State University,"2019-20 S, D",secur
108,Introduction to Network Security I,ESET 0282A,Facilitates competence in networking fundament...,1 semester hour,Not Listed,Not Listed,Idaho State University,"2019-20 F, D",secur
109,Introduction to Network Security II,ESET 0282B,Continuation of ESET 0282A. Through a hands on...,2 semester hours,Not Listed,Not Listed,Idaho State University,"2019-20 F, D",secur
110,Information System Security Design,ESET 0283,Examination of the design methods and techniqu...,3 semester hours,Not Listed,Not Listed,Idaho State University,"2019-20 F, D",secur
113,Critical Network Security,ESET 0286,Comprehensive review and analysis of current a...,3 semester hours,Not Listed,Not Listed,Idaho State University,"2019-20 S, D",secur
114,Cyber Physical Systems Security Capstone,ESET 0289,Promotes professional development through part...,3 semester hours,Not Listed,Not Listed,Idaho State University,"2019-20 F, S",secur
200,Systems Security for Senior Management,INFO 4412,"Review of system architecture, system security...",1-3 semester hours,Not Listed,Not Listed,Idaho State University,2019-20 D,secur
201,Systems Security Administration,INFO 4413,Outlines the basic principles of systems secur...,1-3 semester hours,Not Listed,Not Listed,Idaho State University,2019-20 D,secur


In [7]:
#loop through technical keyword list, extract relevant titles
for word in technical:
    df = idaho[idaho['title'].str.contains(word, flags = re.IGNORECASE)]
    df['keyword2'] = word
    
#join keyword cols
df["keyword"] = df["keyword"].map(str) + "," + df["keyword2"]
df = df.drop(columns="keyword2")

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword
245,Law Office Technology,PARA 0119,Students will learn advanced and specialized c...,2 semester hours,Not Listed,Not Listed,Idaho State University,2019-20 F,"law,technolog"
373,Applied Ethics in Technology,TGE 1257,An introduction to the study of ethics and con...,3 semester hours,Not Listed,Not Listed,Idaho State University,2019-20 D,"ethic,technolog"
616,Law Office Technology,PARA 0119,Students will learn advanced and specialized c...,2 semester hours,Not Listed,Not Listed,Idaho State University,2018-19 F,"law,technolog"
742,Applied Ethics in Technology,TGE 1257,An introduction to the study of ethics and con...,3 semester hours,Not Listed,Not Listed,Idaho State University,2018-19 D,"ethic,technolog"


In [8]:
#combine dfs 
idaho = pd.concat([df, exceptions])
idaho

Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword
245,Law Office Technology,PARA 0119,Students will learn advanced and specialized c...,2 semester hours,Not Listed,Not Listed,Idaho State University,2019-20 F,"law,technolog"
373,Applied Ethics in Technology,TGE 1257,An introduction to the study of ethics and con...,3 semester hours,Not Listed,Not Listed,Idaho State University,2019-20 D,"ethic,technolog"
616,Law Office Technology,PARA 0119,Students will learn advanced and specialized c...,2 semester hours,Not Listed,Not Listed,Idaho State University,2018-19 F,"law,technolog"
742,Applied Ethics in Technology,TGE 1257,An introduction to the study of ethics and con...,3 semester hours,Not Listed,Not Listed,Idaho State University,2018-19 D,"ethic,technolog"
58,Computer Security and Cryptography,CS 4420,"Public key and private key cryptography, key d...",3 semester hours,Not Listed,Not Listed,Idaho State University,2019-20 D,secur
59,Secure Software Engineering,CS 4424,Introduction to the Secure Software Developmen...,3 semester hours,Not Listed,Not Listed,Idaho State University,2019-20 D,secur
107,Wireless Network Security,ESET 0282,Overview of wireless networks with a focus on ...,3 semester hours,Not Listed,Not Listed,Idaho State University,"2019-20 S, D",secur
108,Introduction to Network Security I,ESET 0282A,Facilitates competence in networking fundament...,1 semester hour,Not Listed,Not Listed,Idaho State University,"2019-20 F, D",secur
109,Introduction to Network Security II,ESET 0282B,Continuation of ESET 0282A. Through a hands on...,2 semester hours,Not Listed,Not Listed,Idaho State University,"2019-20 F, D",secur
110,Information System Security Design,ESET 0283,Examination of the design methods and techniqu...,3 semester hours,Not Listed,Not Listed,Idaho State University,"2019-20 F, D",secur


Exporting of code to csv.

In [9]:
#export as csv
idaho.to_csv('43-Idaho State University.csv')