## Columbia University

This script serves as a basic tutorial for extracting courses of interest from a university. This is by no means the only (or even best way) to go about this process—so if you come up with a process that works better, feel free to implement! If you're unfamiliar with any of the libraries, the comments below annotate reasoning behind each.

In [21]:
import sys
import pandas as pd
import numpy as np
import time
import re
import urllib.request #handles urls
from urllib.request import urlopen
import urllib.parse 
import linkGrabber #extracts urls
import json #encodes/decodes json 
import csv 
import requests #downloads a webpage to scrape
from bs4 import BeautifulSoup, NavigableString, Tag #beautifulsoup pulls data from HTML
import nltk #NLP tasks
from nltk import word_tokenize
from nltk.stem import PorterStemmer #removes word endings
stemmer = PorterStemmer()

The first thing we want to do is set up a function for standard preprocessing. It's also useful to list all of the URLs we'll need to send requests to before scraping. We want all courses within a 2 year *academic* calendar (as opposed to an annual calendar). 

In [22]:
#keyword preprocessing
def preprocess(keyword):
    keyword = keyword.lower() #lowercase
    keyword = word_tokenize(keyword) #tokenize
    for word in keyword:
        keyword = stemmer.stem(word) #stem 
    return (keyword)

#course catalog URLs - 2 academic years 
#only 2019 is available, Fall(3), Summer(2), Spring(1)
# urls array

urls = [{"term": 'Fall 2019', "url":"?site=Directory_of_Classes&instr=&days=&semes=20191&hour="},
        {"term": 'Summer 2019', "url":'?si?site=Directory_of_Classes&instr=&days=&semes=20192&hour='},
        {"term": 'Spring 2019', "url":'?site=Directory_of_Classes&instr=&days=&semes=20193&hour='}]

link = 'https://doc.search.columbia.edu/classes/'

Next, we'll want to import our keyword csv, split our keyword lists, and preprocess them. The way the csv is set up, we'll want to split the words that are indicated as technical (`T`) or normative (`N`) and that we've chosen to include (`Y`). You'll notice that preprocessing is useful for some of our words but not for others. Here, we've chosen to manually alter words that are not usefully preprocessed. In this case, it means replacing instances of words that are stemmed to end in i.

[regex is a bitch here]

In [23]:
#import keywords
keywords = pd.read_csv("../keywords.csv")
technical = keywords[(keywords['Technical/Normative']=='T') & (keywords['Include']=='Y')].Keyword
normative = keywords[(keywords['Technical/Normative']=='N') & (keywords['Include']=='Y')].Keyword
normative = [preprocess(i) for i in normative]
technical = [preprocess(i) for i in technical] 

#replace keywords of interest
normative = [w.replace('privaci', 'privac') for w in normative]
normative = [w.replace('democraci', 'democra') for w in normative]
normative = [w.replace('equiti', 'equit') for w in normative]
normative = [w.replace('histori', 'histor') for w in normative]
normative = [w.replace('justice', 'justic') for w in normative]
normative = [w.replace('liberti', 'libert') for w in normative]
normative = [w.replace('philosophi', 'philosoph') for w in normative]
normative = [w.replace('societi', 'societ') for w in normative]
normative = [w.replace('polici', 'polic') for w in normative]

technical = [w.replace('ai', '^ai') for w in technical]
technical = [w.replace('cs', '^cs') for w in technical]
technical = [w.replace('ict', '^ict') for w in technical]
technical = [w.replace('ml', '^ml') for w in technical]
technical = [w.replace('nlp', '^nlp') for w in technical]

print(normative)
print(technical)

['account', 'critic', 'democra', 'discrimin', 'equal', 'equit', 'ethic', 'fair', 'femin', 'gender', 'govern', 'histor', 'inequ', 'justic', 'law', 'legal', 'libert', 'moral', 'norm', 'philosoph', 'polit', 'power', 'privac', 'race', 'religi', 'respons', 'right', 'secur', 'social', 'societ', 'surveil', 'transpar', 'valu', 'polic']
['^ai', 'algorithm', 'analyt', 'intellig', 'automat', 'code', 'comput', '^cs', 'cyber', 'data', 'digit', '^ict', 'inform', 'intelligen', 'internet', 'machin', '^ml', 'process', '^nlp', 'platform', 'program', 'robot', 'softwar', 'system', 'technolog']


The process behind extracting relevant courses works in two steps:
1. First, we want to find and extract all courses that contain any instance of a normative keyword.
2. Then, we want search within these courses to see if it also contains a technical keyword.

We initialize a data frame with columns for all of the course items we want to extract. It probably makes the most sense to standardize these feature names across all university scripts so that they're easier to merge in the final compiled dataset for all universities. Our items of interest are:
* The course title: `title`
* The department and course number: `dept_num`
* The course description: `description`
* The number of credits for the course: `credits`
* The course instructor: `instructor`
* The link to the course syllabus (if applicable): `syllabus`
* The university the course is extracted from: `university`
* The term that the course is offered during (fall, spring, summer / year): `term`
* The keyword that triggered the extraction (this is for auditing purposes): `keyword`

In [24]:
#init dfs
# columbia = pd.DataFrame(columns=['title','dept_num','description','credits','instructor',
#                                 'syllabus','university','term','keyword'])
# columbia = pd.DataFrame(columns=['title','university','term','keyword'])
columbia_list = []

The loop below executes part 1 of our extraction. It's long and kind of messy (sorry), so feel free to play around with the structure if you'd like. The key tasks here are to extract our items of interest based on our search queries and append them to our data frame.

In [25]:
#roster search for all urls
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains

for url in urls:
    print("term", url["term"])
    
    #loop through all normative words and extract relevant elements 
    for word in normative: 
        print('-------------------')
        url_keyword = link + word + url['url'] #NOTE:this structure will likely be different between rosters!
        driver = webdriver.Chrome()
        driver.get(url_keyword)
        time.sleep(4)

        #the number of reponses
        elements = driver.find_elements_by_xpath('//*[@id="gsa-search-results"]/li')
        results = len(elements)
        print("elements", len(elements))
        
        #scraping each results
        for x in range(0, results):
            columbia_dict = {}
                   
            title = driver.find_element_by_xpath('//*[@id="gsa-search-results"]/li[' + str(x+1) +']/div/h3/a').text
            section = driver.find_element_by_xpath('//*[@id="gsa-search-results"]/li[' + str(x+1) +']/div/h3/a/span').text
            title = title.replace(section, '').strip()
            columbia_dict['title'] = title
            
            print(x, "/", len(elements))
#             print("SECTION", section)
            
#             syllabi = ''  
            columbia_dict['university'] = 'columbia university'
            columbia_dict['term'] = url["term"]  
            columbia_dict['keyword'] = word
            
#             columbia = columbia.append([title, dept_nums, descs, credit, profs, syllabi, uni, term, keyword])
            
            # only added if the title contains the word
            if word.upper() in title:
#                 course_link = driver.find_element_by_xpath('//*[@id="gsa-search-results"]/li[' + str(x+1) +']/div/div[2]').text
    
#                 driver.get(course_link)
#                 time.sleep(3)
            
#                 try:
#                     dept_nums = driver.find_element_by_xpath('//*[@id="col-right"]/table/tbody/tr[10]/td[2]').text
#                     columbia_dict['dept_nums'] = dept_nums
#                 except:
#                     pass


#                 try:
#                     credit = driver.find_element_by_xpath('//*[@id="col-right"]/table/tbody/tr[4]/td[2]').text
#                     columbia_dict['credit'] = credit
#                 except:
#                     pass

#                 try:
#                     profs = driver.find_element_by_xpath('//*[@id="col-right"]/table/tbody/tr[7]/td[2]').text
#                     columbia_dict['profs'] = profs
#                 except:
#                     pass
            
                columbia_list.append(columbia_dict)
            
        driver.close()


term Fall 2019
-------------------
elements 54
0 / 54
1 / 54
2 / 54
3 / 54
4 / 54
5 / 54
6 / 54
7 / 54
8 / 54
9 / 54
10 / 54
11 / 54
12 / 54
13 / 54
14 / 54
15 / 54
16 / 54
17 / 54
18 / 54
19 / 54
20 / 54
21 / 54
22 / 54
23 / 54
24 / 54
25 / 54
26 / 54
27 / 54
28 / 54
29 / 54
30 / 54
31 / 54
32 / 54
33 / 54
34 / 54
35 / 54
36 / 54
37 / 54
38 / 54
39 / 54
40 / 54
41 / 54
42 / 54
43 / 54
44 / 54
45 / 54
46 / 54
47 / 54
48 / 54
49 / 54
50 / 54
51 / 54
52 / 54
53 / 54
-------------------
elements 162
0 / 162
1 / 162
2 / 162
3 / 162
4 / 162
5 / 162
6 / 162
7 / 162
8 / 162
9 / 162
10 / 162
11 / 162
12 / 162
13 / 162
14 / 162
15 / 162
16 / 162
17 / 162
18 / 162
19 / 162
20 / 162
21 / 162
22 / 162
23 / 162
24 / 162
25 / 162
26 / 162
27 / 162
28 / 162
29 / 162
30 / 162
31 / 162
32 / 162
33 / 162
34 / 162
35 / 162
36 / 162
37 / 162
38 / 162
39 / 162
40 / 162
41 / 162
42 / 162
43 / 162
44 / 162
45 / 162
46 / 162
47 / 162
48 / 162
49 / 162
50 / 162
51 / 162
52 / 162
53 / 162
54 / 162
55 / 162
56 /

elements 302
0 / 302
1 / 302
2 / 302
3 / 302
4 / 302
5 / 302
6 / 302
7 / 302
8 / 302
9 / 302
10 / 302
11 / 302
12 / 302
13 / 302
14 / 302
15 / 302
16 / 302
17 / 302
18 / 302
19 / 302
20 / 302
21 / 302
22 / 302
23 / 302
24 / 302
25 / 302
26 / 302
27 / 302
28 / 302
29 / 302
30 / 302
31 / 302
32 / 302
33 / 302
34 / 302
35 / 302
36 / 302
37 / 302
38 / 302
39 / 302
40 / 302
41 / 302
42 / 302
43 / 302
44 / 302
45 / 302
46 / 302
47 / 302
48 / 302
49 / 302
50 / 302
51 / 302
52 / 302
53 / 302
54 / 302
55 / 302
56 / 302
57 / 302
58 / 302
59 / 302
60 / 302
61 / 302
62 / 302
63 / 302
64 / 302
65 / 302
66 / 302
67 / 302
68 / 302
69 / 302
70 / 302
71 / 302
72 / 302
73 / 302
74 / 302
75 / 302
76 / 302
77 / 302
78 / 302
79 / 302
80 / 302
81 / 302
82 / 302
83 / 302
84 / 302
85 / 302
86 / 302
87 / 302
88 / 302
89 / 302
90 / 302
91 / 302
92 / 302
93 / 302
94 / 302
95 / 302
96 / 302
97 / 302
98 / 302
99 / 302
100 / 302
101 / 302
102 / 302
103 / 302
104 / 302
105 / 302
106 / 302
107 / 302
108 / 302
109 / 3

79 / 146
80 / 146
81 / 146
82 / 146
83 / 146
84 / 146
85 / 146
86 / 146
87 / 146
88 / 146
89 / 146
90 / 146
91 / 146
92 / 146
93 / 146
94 / 146
95 / 146
96 / 146
97 / 146
98 / 146
99 / 146
100 / 146
101 / 146
102 / 146
103 / 146
104 / 146
105 / 146
106 / 146
107 / 146
108 / 146
109 / 146
110 / 146
111 / 146
112 / 146
113 / 146
114 / 146
115 / 146
116 / 146
117 / 146
118 / 146
119 / 146
120 / 146
121 / 146
122 / 146
123 / 146
124 / 146
125 / 146
126 / 146
127 / 146
128 / 146
129 / 146
130 / 146
131 / 146
132 / 146
133 / 146
134 / 146
135 / 146
136 / 146
137 / 146
138 / 146
139 / 146
140 / 146
141 / 146
142 / 146
143 / 146
144 / 146
145 / 146
-------------------
elements 280
0 / 280
1 / 280
2 / 280
3 / 280
4 / 280
5 / 280
6 / 280
7 / 280
8 / 280
9 / 280
10 / 280
11 / 280
12 / 280
13 / 280
14 / 280
15 / 280
16 / 280
17 / 280
18 / 280
19 / 280
20 / 280
21 / 280
22 / 280
23 / 280
24 / 280
25 / 280
26 / 280
27 / 280
28 / 280
29 / 280
30 / 280
31 / 280
32 / 280
33 / 280
34 / 280
35 / 280
36 /

182 / 302
183 / 302
184 / 302
185 / 302
186 / 302
187 / 302
188 / 302
189 / 302
190 / 302
191 / 302
192 / 302
193 / 302
194 / 302
195 / 302
196 / 302
197 / 302
198 / 302
199 / 302
200 / 302
201 / 302
202 / 302
203 / 302
204 / 302
205 / 302
206 / 302
207 / 302
208 / 302
209 / 302
210 / 302
211 / 302
212 / 302
213 / 302
214 / 302
215 / 302
216 / 302
217 / 302
218 / 302
219 / 302
220 / 302
221 / 302
222 / 302
223 / 302
224 / 302
225 / 302
226 / 302
227 / 302
228 / 302
229 / 302
230 / 302
231 / 302
232 / 302
233 / 302
234 / 302
235 / 302
236 / 302
237 / 302
238 / 302
239 / 302
240 / 302
241 / 302
242 / 302
243 / 302
244 / 302
245 / 302
246 / 302
247 / 302
248 / 302
249 / 302
250 / 302
251 / 302
252 / 302
253 / 302
254 / 302
255 / 302
256 / 302
257 / 302
258 / 302
259 / 302
260 / 302
261 / 302
262 / 302
263 / 302
264 / 302
265 / 302
266 / 302
267 / 302
268 / 302
269 / 302
270 / 302
271 / 302
272 / 302
273 / 302
274 / 302
275 / 302
276 / 302
277 / 302
278 / 302
279 / 302
280 / 302
281 / 302


elements 3
0 / 3
1 / 3
2 / 3
-------------------
elements 14
0 / 14
1 / 14
2 / 14
3 / 14
4 / 14
5 / 14
6 / 14
7 / 14
8 / 14
9 / 14
10 / 14
11 / 14
12 / 14
13 / 14
-------------------
elements 288
0 / 288
1 / 288
2 / 288
3 / 288
4 / 288
5 / 288
6 / 288
7 / 288
8 / 288
9 / 288
10 / 288
11 / 288
12 / 288
13 / 288
14 / 288
15 / 288
16 / 288
17 / 288
18 / 288
19 / 288
20 / 288
21 / 288
22 / 288
23 / 288
24 / 288
25 / 288
26 / 288
27 / 288
28 / 288
29 / 288
30 / 288
31 / 288
32 / 288
33 / 288
34 / 288
35 / 288
36 / 288
37 / 288
38 / 288
39 / 288
40 / 288
41 / 288
42 / 288
43 / 288
44 / 288
45 / 288
46 / 288
47 / 288
48 / 288
49 / 288
50 / 288
51 / 288
52 / 288
53 / 288
54 / 288
55 / 288
56 / 288
57 / 288
58 / 288
59 / 288
60 / 288
61 / 288
62 / 288
63 / 288
64 / 288
65 / 288
66 / 288
67 / 288
68 / 288
69 / 288
70 / 288
71 / 288
72 / 288
73 / 288
74 / 288
75 / 288
76 / 288
77 / 288
78 / 288
79 / 288
80 / 288
81 / 288
82 / 288
83 / 288
84 / 288
85 / 288
86 / 288
87 / 288
88 / 288
89 / 288
90 /

5 / 16
6 / 16
7 / 16
8 / 16
9 / 16
10 / 16
11 / 16
12 / 16
13 / 16
14 / 16
15 / 16
-------------------
elements 39
0 / 39
1 / 39
2 / 39
3 / 39
4 / 39
5 / 39
6 / 39
7 / 39
8 / 39
9 / 39
10 / 39
11 / 39
12 / 39
13 / 39
14 / 39
15 / 39
16 / 39
17 / 39
18 / 39
19 / 39
20 / 39
21 / 39
22 / 39
23 / 39
24 / 39
25 / 39
26 / 39
27 / 39
28 / 39
29 / 39
30 / 39
31 / 39
32 / 39
33 / 39
34 / 39
35 / 39
36 / 39
37 / 39
38 / 39
term Spring 2019
-------------------
elements 37
0 / 37
1 / 37
2 / 37
3 / 37
4 / 37
5 / 37
6 / 37
7 / 37
8 / 37
9 / 37
10 / 37
11 / 37
12 / 37
13 / 37
14 / 37
15 / 37
16 / 37
17 / 37
18 / 37
19 / 37
20 / 37
21 / 37
22 / 37
23 / 37
24 / 37
25 / 37
26 / 37
27 / 37
28 / 37
29 / 37
30 / 37
31 / 37
32 / 37
33 / 37
34 / 37
35 / 37
36 / 37
-------------------
elements 163
0 / 163
1 / 163
2 / 163
3 / 163
4 / 163
5 / 163
6 / 163
7 / 163
8 / 163
9 / 163
10 / 163
11 / 163
12 / 163
13 / 163
14 / 163
15 / 163
16 / 163
17 / 163
18 / 163
19 / 163
20 / 163
21 / 163
22 / 163
23 / 163
24 / 163


28 / 302
29 / 302
30 / 302
31 / 302
32 / 302
33 / 302
34 / 302
35 / 302
36 / 302
37 / 302
38 / 302
39 / 302
40 / 302
41 / 302
42 / 302
43 / 302
44 / 302
45 / 302
46 / 302
47 / 302
48 / 302
49 / 302
50 / 302
51 / 302
52 / 302
53 / 302
54 / 302
55 / 302
56 / 302
57 / 302
58 / 302
59 / 302
60 / 302
61 / 302
62 / 302
63 / 302
64 / 302
65 / 302
66 / 302
67 / 302
68 / 302
69 / 302
70 / 302
71 / 302
72 / 302
73 / 302
74 / 302
75 / 302
76 / 302
77 / 302
78 / 302
79 / 302
80 / 302
81 / 302
82 / 302
83 / 302
84 / 302
85 / 302
86 / 302
87 / 302
88 / 302
89 / 302
90 / 302
91 / 302
92 / 302
93 / 302
94 / 302
95 / 302
96 / 302
97 / 302
98 / 302
99 / 302
100 / 302
101 / 302
102 / 302
103 / 302
104 / 302
105 / 302
106 / 302
107 / 302
108 / 302
109 / 302
110 / 302
111 / 302
112 / 302
113 / 302
114 / 302
115 / 302
116 / 302
117 / 302
118 / 302
119 / 302
120 / 302
121 / 302
122 / 302
123 / 302
124 / 302
125 / 302
126 / 302
127 / 302
128 / 302
129 / 302
130 / 302
131 / 302
132 / 302
133 / 302
134 / 302
13

64 / 262
65 / 262
66 / 262
67 / 262
68 / 262
69 / 262
70 / 262
71 / 262
72 / 262
73 / 262
74 / 262
75 / 262
76 / 262
77 / 262
78 / 262
79 / 262
80 / 262
81 / 262
82 / 262
83 / 262
84 / 262
85 / 262
86 / 262
87 / 262
88 / 262
89 / 262
90 / 262
91 / 262
92 / 262
93 / 262
94 / 262
95 / 262
96 / 262
97 / 262
98 / 262
99 / 262
100 / 262
101 / 262
102 / 262
103 / 262
104 / 262
105 / 262
106 / 262
107 / 262
108 / 262
109 / 262
110 / 262
111 / 262
112 / 262
113 / 262
114 / 262
115 / 262
116 / 262
117 / 262
118 / 262
119 / 262
120 / 262
121 / 262
122 / 262
123 / 262
124 / 262
125 / 262
126 / 262
127 / 262
128 / 262
129 / 262
130 / 262
131 / 262
132 / 262
133 / 262
134 / 262
135 / 262
136 / 262
137 / 262
138 / 262
139 / 262
140 / 262
141 / 262
142 / 262
143 / 262
144 / 262
145 / 262
146 / 262
147 / 262
148 / 262
149 / 262
150 / 262
151 / 262
152 / 262
153 / 262
154 / 262
155 / 262
156 / 262
157 / 262
158 / 262
159 / 262
160 / 262
161 / 262
162 / 262
163 / 262
164 / 262
165 / 262
166 / 262
167 / 

63 / 302
64 / 302
65 / 302
66 / 302
67 / 302
68 / 302
69 / 302
70 / 302
71 / 302
72 / 302
73 / 302
74 / 302
75 / 302
76 / 302
77 / 302
78 / 302
79 / 302
80 / 302
81 / 302
82 / 302
83 / 302
84 / 302
85 / 302
86 / 302
87 / 302
88 / 302
89 / 302
90 / 302
91 / 302
92 / 302
93 / 302
94 / 302
95 / 302
96 / 302
97 / 302
98 / 302
99 / 302
100 / 302
101 / 302
102 / 302
103 / 302
104 / 302
105 / 302
106 / 302
107 / 302
108 / 302
109 / 302
110 / 302
111 / 302
112 / 302
113 / 302
114 / 302
115 / 302
116 / 302
117 / 302
118 / 302
119 / 302
120 / 302
121 / 302
122 / 302
123 / 302
124 / 302
125 / 302
126 / 302
127 / 302
128 / 302
129 / 302
130 / 302
131 / 302
132 / 302
133 / 302
134 / 302
135 / 302
136 / 302
137 / 302
138 / 302
139 / 302
140 / 302
141 / 302
142 / 302
143 / 302
144 / 302
145 / 302
146 / 302
147 / 302
148 / 302
149 / 302
150 / 302
151 / 302
152 / 302
153 / 302
154 / 302
155 / 302
156 / 302
157 / 302
158 / 302
159 / 302
160 / 302
161 / 302
162 / 302
163 / 302
164 / 302
165 / 302
166 / 3

In [26]:
columbia = pd.DataFrame(columbia_list)

#title only
# for word in normative:
#     columbia_df = columbia[columbia['title'].str.contains(word, flags = re.IGNORECASE)]

columbia

Unnamed: 0,keyword,term,title,university
0,account,Fall 2019,ACCOUNTING AND FINANCE,columbia university
1,account,Fall 2019,FINANCIAL ACCOUNTING,columbia university
2,account,Fall 2019,FINANCIAL ACCOUNTING,columbia university
3,account,Fall 2019,ACCOUNTING & BUDGETING,columbia university
4,account,Fall 2019,BUDGETING/ACCOUNTING FILMMAKERS,columbia university
5,account,Fall 2019,HLTH CARE ACCOUNTING&BUDGETING,columbia university
6,account,Fall 2019,HUMAN RIGHTS ACCOUNTABILITY & REMEDIES,columbia university
7,account,Fall 2019,ACCOUNTABILTY-ETHIC IN HUMANITARIAN,columbia university
8,account,Fall 2019,ACCOUNTING FOR THEATRE,columbia university
9,account,Fall 2019,FIN & ACCOUNTNG IN CONSTRUCTION INDUSTRY,columbia university


Now that we've extracted all courses containing a normative keyword of interest, we need to filter our courses to only return titles that contain a normative AND a technical keyword. This is the case for all words except instances of our preprocessed `privac` and `secur`, for which we want to return all courses, even if they don't contain two keywords. To do this, we'll split the courses into two data frames, apply our respective conditions, and then merge them back together. 

In [27]:
exceptions = columbia.loc[(columbia['keyword']=='privac') | (columbia['keyword'] =='secur')]
exceptions

Unnamed: 0,keyword,term,title,university
358,secur,Fall 2019,NAT SECURITY STRAT OF MID EAST,columbia university
359,secur,Fall 2019,SECURITY ANALYSIS,columbia university
360,secur,Fall 2019,SECURITY II,columbia university
361,secur,Fall 2019,SECURITIES REGULATION,columbia university
362,secur,Fall 2019,INT'L SECURITIES REGULATION,columbia university
363,secur,Fall 2019,RE DEBT SECURITIZATION,columbia university
364,secur,Fall 2019,ENTERPR INFO SECURITY: THREATS & DEFENSE,columbia university
530,secur,Summer 2019,SECURITY ANALYSIS,columbia university
531,secur,Summer 2019,EAST ASIAN SECURITY,columbia university
532,secur,Summer 2019,ENTERPR INFO SECURITY: THREATS & DEFENSE,columbia university


In [28]:
#loop through technical keyword list, extract relevant titles
for word in technical:
    df = columbia[columbia['title'].str.contains(word, flags = re.IGNORECASE)]
    df['keyword2'] = word
    
#join keyword cols
df["keyword"] = df["keyword"].map(str) + "," + df["keyword2"]
df = df.drop(columns="keyword2")

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


Unnamed: 0,keyword,term,title,university
325,"power,technolog",Fall 2019,TECHNOLOGY AND POWER IN MODERN CHINA,columbia university
343,"religi,technolog",Fall 2019,"TECHNOLOGY,RELIGION,FUTURE",columbia university
464,"polic,technolog",Fall 2019,"INTERNET TECHNOLOGY,ECONOMICS,AND POLICY",columbia university
755,"law,technolog",Spring 2019,BIOTECHNOLOGY LAW,columbia university


NOTE: the above cell is likely not the best nor most simple way to execute this step! Feel free to take special liberties here. It's probably wise to pick out a few titles that you know should be returned manually, then check to see if the script is working as desired. 

In [29]:
#combine dfs 
columbia = pd.concat([df, exceptions])
columbia

Unnamed: 0,keyword,term,title,university
325,"power,technolog",Fall 2019,TECHNOLOGY AND POWER IN MODERN CHINA,columbia university
343,"religi,technolog",Fall 2019,"TECHNOLOGY,RELIGION,FUTURE",columbia university
464,"polic,technolog",Fall 2019,"INTERNET TECHNOLOGY,ECONOMICS,AND POLICY",columbia university
755,"law,technolog",Spring 2019,BIOTECHNOLOGY LAW,columbia university
358,secur,Fall 2019,NAT SECURITY STRAT OF MID EAST,columbia university
359,secur,Fall 2019,SECURITY ANALYSIS,columbia university
360,secur,Fall 2019,SECURITY II,columbia university
361,secur,Fall 2019,SECURITIES REGULATION,columbia university
362,secur,Fall 2019,INT'L SECURITIES REGULATION,columbia university
363,secur,Fall 2019,RE DEBT SECURITIZATION,columbia university


Lastly, we want to export our csv. Ideally, all csv files should be written to the courses directory in our repository. 

In [30]:
#export as csv
columbia.to_csv('../courses/columbia.csv')