## University of Mississippi Crawler

Imports.

In [1]:
import pandas as pd
import numpy as np
import re
import urllib.request #handles urls
import urllib.parse 
import linkGrabber #extracts urls
import json #encodes/decodes json 
import csv 
import requests #downloads a webpage to scrape
from bs4 import BeautifulSoup, NavigableString, Tag #beautifulsoup pulls data from HTML
import nltk #NLP tasks
from nltk import word_tokenize
from nltk.stem import PorterStemmer #removes word endings
stemmer = PorterStemmer()

Keyword preprocessing and url list of relevant catalog years and terms. Also a list of the alphabet for the A-Z catalog.

In [2]:
#keyword preprocessing
def preprocess(keyword):
    keyword = keyword.lower() #lowercase
    keyword = word_tokenize(keyword) #tokenize
    for word in keyword:
        keyword = stemmer.stem(word) #stem 
    return (keyword)

#course catalog URLs - 2 academic years, fall and spring terms 
urls = ['https://catalog.olemiss.edu/2020/spring/courses/', #2019-20 Spring
        'https://catalog.olemiss.edu/courses/', #2019-20 Fall
        'https://catalog.olemiss.edu/2019/spring/courses/', #2018-19 Spring
        'https://catalog.olemiss.edu/2019/fall/courses/'] #2018-19 Fall

#list of the uppercase alphabet for the A-Z index 
alphabet = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W']

Creation of normative and technical keywords lists, the same as in example crawler.

In [3]:
#import keywords
keywords = pd.read_csv("keywords.csv")
technical = keywords[(keywords['Technical/Normative']=='T') & (keywords['Include']=='Y')].Keyword
normative = keywords[(keywords['Technical/Normative']=='N') & (keywords['Include']=='Y')].Keyword
normative = [preprocess(i) for i in normative]
technical = [preprocess(i) for i in technical] 

#replace keywords of interest
normative = [w.replace('privaci', 'privac') for w in normative]
normative = [w.replace('democraci', 'democra') for w in normative]
normative = [w.replace('equiti', 'equit') for w in normative]
normative = [w.replace('histori', 'histor') for w in normative]
normative = [w.replace('justice', 'justic') for w in normative]
normative = [w.replace('liberti', 'libert') for w in normative]
normative = [w.replace('philosophi', 'philosoph') for w in normative]
normative = [w.replace('societi', 'societ') for w in normative]
normative = [w.replace('polici', 'polic') for w in normative]

technical = [w.replace('ai', '^ai') for w in technical]
technical = [w.replace('cs', '^cs') for w in technical]
technical = [w.replace('ict', '^ict') for w in technical]
technical = [w.replace('ml', '^ml') for w in technical]
technical = [w.replace('nlp', '^nlp') for w in technical]

print(normative)
print(technical)

['account', 'critic', 'democra', 'discrimin', 'equal', 'equit', 'ethic', 'fair', 'femin', 'gender', 'govern', 'histor', 'inequ', 'justic', 'law', 'legal', 'libert', 'moral', 'norm', 'philosoph', 'polit', 'power', 'privac', 'race', 'religi', 'respons', 'right', 'secur', 'social', 'societ', 'surveil', 'transpar', 'valu', 'polic']
['^ai', 'algorithm', 'analyt', 'intellig', 'automat', 'code', 'comput', '^cs', 'cyber', 'data', 'digit', '^ict', 'inform', 'intelligen', 'internet', 'machin', '^ml', 'process', '^nlp', 'platform', 'program', 'robot', 'softwar', 'system', 'technolog']


Extraction process for Oregon State:
1. Loop through the urls for the years and terms.
2. Loop through all the pages of the A-W course index to get all the courses and departments.
3. Search through the title of each course on the index page to find which courses match the keywords.
4. Gather information for course title and number.
5. Open the links for the courses that match the keywords. (There are a lot of links to open so this makes the program run long)
6. Gather information for credits and course description.

Data columns are defined in the same way as below and have the same anatomy for each course:
* The course title - after the ':' in the title on the index page: `title`
* The department and course number - before the ':' in the title on the index page: `dept_num`
* The course description - the first list element for the course information: `description`
* The number of credits for the course - the second list element for the course information: `credits`
* The course instructor - school does not list in catalog: `instructor`
* The link to the course syllabus (if applicable) - school does not list in catalog: `syllabus`
* The university the course is extracted from - all from the same university: `university`
* The term that the course is offered during (fall, spring, summer / year) - found by association with the corresponfing url in urls: `term`
* The keyword that triggered the extraction (this is for auditing purposes): `keyword`

In [4]:
#init dfs
mississippi = pd.DataFrame(columns=['title','dept_num','description','credits','instructor',
                                'syllabus','university','term','keyword','URL'])
titles = []
dept_nums = []
descs = []
credit = []
profs = []
syllabi = []
uni = []
term = []     
keyword = []
URL = []

The extraction process. The process to create the table is kept the same as the example crawler, just as a loop on it's own after all the titles, credits, etc. are all gathered.

In [8]:
#looping through each years catalog
for url in urls:
    #looping through the A-Z index
    for alpha in alphabet:
        page_link = url + alpha
        page_response = requests.get(page_link)
        soup = BeautifulSoup(page_response.content, 'html.parser')
        #Creates lists for the course titles and links respectively
        course_titles = [p.get_text() for p in soup.find_all('a')]
        course_links = [p.get('href') for p in soup.find_all('a')]
        #The if-else-if block below makes sure the course lists above only contain elements from courses
        #and not the other elements tagged 'a' in the page by parsing out the unecessary elements
        if 'View this in current catalog' in course_titles:
            course_links = course_links[course_titles.index('W')+3:
                                          course_titles.index('View this in current catalog')]
            course_titles = course_titles[course_titles.index('W')+3:
                                          course_titles.index('View this in current catalog')]
        else:
            course_links = course_links[course_titles.index('W')+3:
                                          course_titles.index('View this in another catalog')]
            course_titles = course_titles[course_titles.index('W')+3:
                                          course_titles.index('View this in another catalog')]
        if 'W' in course_titles:
            course_links = course_links[course_titles.index('W')+3:]
            course_titles = course_titles[course_titles.index('W')+3:]
        for x in range(len(course_titles)):
            title = course_titles[x]
            for word in normative:
                if word in title.lower() and ':' in title:
                    #For all the courses that contain keywwords, acceses and opens they're URL link that is synced 
                    #between the course titles and links, and gathers from that page decription and credit information
                    info_response = requests.get(course_links[x])
                    tea = BeautifulSoup(info_response.content, 'html.parser')
                    information = [p.get_text() for p in tea.find_all('p')]
                    URL.append(course_links[x])
                    titles.append(title[title.index(':')+1:])
                    dept_nums.append(title[:title.index(':')])
                    descs.append(information[0])
                    credit.append(information[1])
                    profs.append('Not Listed')
                    syllabi.append('Not Listed')
                    uni.append('University of Mississippi')
                    if url=='https://catalog.olemiss.edu/2020/spring/courses/': term.append('Spring 2020')
                    elif url=='https://catalog.olemiss.edu/courses/': term.append('Fall 2020')
                    elif url=='https://catalog.olemiss.edu/2019/spring/courses/': term.append('Spring 2019')
                    else: term.append('Fall 2019')
                    keyword.append(word)
            
for a,b,c,d,e,f,g,h,i,j in zip(titles,dept_nums,descs,credit,profs,syllabi,uni,term,keyword,URL):
    mississippi = mississippi.append({'title': a, 
                              'dept_num': b,
                              'description': c,
                              'credits': d,
                              'instructor': e,
                              'syllabus': f,
                              'university': g,
                              'term': h,
                              'keyword': i,
                              'URL': j}, ignore_index=True)


Post filtering of course. Code is identical to that of example crawler.

In [9]:
exceptions = mississippi.loc[(mississippi['keyword']=='privac') | (mississippi['keyword'] =='secur')]
exceptions

Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword,URL
46,Securities Regulations,Accy 650,An examination of federal and state securities...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,secur,https://catalog.olemiss.edu/2020/spring/patter...
99,Introduction to Homeland Security,CJ 115,The issues pertaining to the role and mission ...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,secur,https://catalog.olemiss.edu/2020/spring/applie...
112,Homeland Security Operations,CJ 400,An examination of government agencies that are...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,secur,https://catalog.olemiss.edu/2020/spring/applie...
114,Homeland Security Law,CJ 420,Examination of current domestic legal issues r...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,secur,https://catalog.olemiss.edu/2020/spring/applie...
119,Border Security,CJ 470,This course provides the student with an analy...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,secur,https://catalog.olemiss.edu/2020/spring/applie...
130,Seminar in Homeland Security,CJ 630,"Examines security theories, research, and prac...",3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,secur,https://catalog.olemiss.edu/2020/spring/applie...
134,Critical Infrastructure Security,CJ 636,Review of U.S. counterterrorism policies and p...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,secur,https://catalog.olemiss.edu/2020/spring/applie...
135,Cybercrime and Cyber Security,CJ 642,Overview of current issues surrounding the tec...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,secur,https://catalog.olemiss.edu/2020/spring/applie...
143,Intelligence and Homeland Security,CJ 670,Advanced course on intelligence and counterint...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,secur,https://catalog.olemiss.edu/2020/spring/applie...
159,Fundamentals of Computer Security,Csci 427,This course explores the concepts and methods ...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,secur,https://catalog.olemiss.edu/2020/spring/engine...


In [10]:
#loop through technical keyword list, extract relevant titles
for word in technical:
    df = mississippi[mississippi['title'].str.contains(word, flags = re.IGNORECASE)]
    df['keyword2'] = word
    
#join keyword cols
df["keyword"] = df["keyword"].map(str) + "," + df["keyword2"]
df = df.drop(columns="keyword2")

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword,URL
174,Science Technology Society in Classroom,Edci 616,"The interrelationships among trends, issues, a...",3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,"societ,technolog",https://catalog.olemiss.edu/2020/spring/educat...
569,Science Technology Society in Classroom,Edci 616,"The interrelationships among trends, issues, a...",3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,"societ,technolog",https://catalog.olemiss.edu/2020/spring/educat...
820,"Mass Comm, Technology, and Society",Jour 573,The theory of mass communications technology i...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,"societ,technolog",https://catalog.olemiss.edu/2020/spring/journa...
1186,Philosophy of Technology,Phil 340,This course will examine philosophical issues ...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,"philosoph,technolog",https://catalog.olemiss.edu/2020/spring/libera...
1365,"Science, Technology, & Public Policy",PPL 386,Examination of factors which shape public poli...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,"polic,technolog",https://catalog.olemiss.edu/2020/spring/libera...
1440,"Science, Technology and Society",Soc 321,An examination of the nature of relationships ...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,"societ,technolog",https://catalog.olemiss.edu/2020/spring/libera...
1452,"Environment, Technology and Society",Soc 411,This course will explore the ways people relat...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,"societ,technolog",https://catalog.olemiss.edu/2020/spring/libera...
1706,Science Technology Society in Classroom,Edci 616,"The interrelationships among trends, issues, a...",3 Credits,Not Listed,Not Listed,University of Mississippi,Fall 2020,"societ,technolog",https://catalog.olemiss.edu/education/teacher-...
1957,"Mass Comm, Technology, and Society",Jour 573,The theory of mass communications technology i...,3 Credits,Not Listed,Not Listed,University of Mississippi,Fall 2020,"societ,technolog",https://catalog.olemiss.edu/journalism/jour-573
2323,Philosophy of Technology,Phil 340,This course will examine philosophical issues ...,3 Credits,Not Listed,Not Listed,University of Mississippi,Fall 2020,"philosoph,technolog",https://catalog.olemiss.edu/liberal-arts/philo...


In [11]:
#combine dfs 
mississippi = pd.concat([df, exceptions])
mississippi

Unnamed: 0,title,dept_num,description,credits,instructor,syllabus,university,term,keyword,URL
174,Science Technology Society in Classroom,Edci 616,"The interrelationships among trends, issues, a...",3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,"societ,technolog",https://catalog.olemiss.edu/2020/spring/educat...
569,Science Technology Society in Classroom,Edci 616,"The interrelationships among trends, issues, a...",3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,"societ,technolog",https://catalog.olemiss.edu/2020/spring/educat...
820,"Mass Comm, Technology, and Society",Jour 573,The theory of mass communications technology i...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,"societ,technolog",https://catalog.olemiss.edu/2020/spring/journa...
1186,Philosophy of Technology,Phil 340,This course will examine philosophical issues ...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,"philosoph,technolog",https://catalog.olemiss.edu/2020/spring/libera...
1365,"Science, Technology, & Public Policy",PPL 386,Examination of factors which shape public poli...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,"polic,technolog",https://catalog.olemiss.edu/2020/spring/libera...
1440,"Science, Technology and Society",Soc 321,An examination of the nature of relationships ...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,"societ,technolog",https://catalog.olemiss.edu/2020/spring/libera...
1452,"Environment, Technology and Society",Soc 411,This course will explore the ways people relat...,3 Credits,Not Listed,Not Listed,University of Mississippi,Spring 2020,"societ,technolog",https://catalog.olemiss.edu/2020/spring/libera...
1706,Science Technology Society in Classroom,Edci 616,"The interrelationships among trends, issues, a...",3 Credits,Not Listed,Not Listed,University of Mississippi,Fall 2020,"societ,technolog",https://catalog.olemiss.edu/education/teacher-...
1957,"Mass Comm, Technology, and Society",Jour 573,The theory of mass communications technology i...,3 Credits,Not Listed,Not Listed,University of Mississippi,Fall 2020,"societ,technolog",https://catalog.olemiss.edu/journalism/jour-573
2323,Philosophy of Technology,Phil 340,This course will examine philosophical issues ...,3 Credits,Not Listed,Not Listed,University of Mississippi,Fall 2020,"philosoph,technolog",https://catalog.olemiss.edu/liberal-arts/philo...


Exporting of code to csv.

In [12]:
#export as csv
mississippi.to_csv('5-University of Mississippi.csv')