# Courses Demo
This Jupyter notebook is for exploring the data set courses20-21.json
which consists of all Brandeis courses in the 20-21 academic year (Fall20, Spr21, Sum21) 
which had at least 1 student enrolled.

First we need to read the json file into a list of Python dictionaries

In [1]:
import json
import statistics

In [2]:
with open("courses20-21.json","r",encoding='utf-8') as jsonfile:
    courses = json.load(jsonfile)

## Structure of a course
Next we look at the fields of each course dictionary and their values

In [3]:
print('there are',len(courses),'courses in the dataset')
print('here is the data for course 1246')
courses[1246]

there are 7813 courses in the dataset
here is the data for course 1246


{'limit': 28,
 'times': [{'start': 1080, 'end': 1170, 'days': ['w', 'm']}],
 'enrolled': 4,
 'details': 'Instruction for this course will be offered remotely. Meeting times for this course are listed in the schedule of classes (in ET).',
 'type': 'section',
 'status_text': 'Open',
 'section': '1',
 'waiting': 0,
 'instructor': ['An', 'Huang', 'anhuang@brandeis.edu'],
 'coinstructors': [],
 'code': ['MATH', '223A'],
 'subject': 'MATH',
 'coursenum': '223A',
 'name': 'Lie Algebras: Representation Theory',
 'independent_study': False,
 'term': '1203',
 'description': "Theorems of Engel and Lie. Semisimple Lie algebras, Cartan's criterion. Universal enveloping algebras, PBW theorem, Serre's construction. Representation theory. Other topics as time permits. Usually offered every second year.\nAn Huang"}

## Cleaning the data
If we want to sort courses by instructor or by code, we need to replace the lists with tuples (which are immutable lists)

In [4]:
for course in courses:
        course['instructor'] = tuple(course['instructor'])
        course['coinstructors'] = tuple([tuple(f) for f in course['coinstructors']])
        course['code']= tuple(course['code'])

In [5]:
print('notice that the instructor and code are tuples now')
courses[1246]

notice that the instructor and code are tuples now


{'limit': 28,
 'times': [{'start': 1080, 'end': 1170, 'days': ['w', 'm']}],
 'enrolled': 4,
 'details': 'Instruction for this course will be offered remotely. Meeting times for this course are listed in the schedule of classes (in ET).',
 'type': 'section',
 'status_text': 'Open',
 'section': '1',
 'waiting': 0,
 'instructor': ('An', 'Huang', 'anhuang@brandeis.edu'),
 'coinstructors': (),
 'code': ('MATH', '223A'),
 'subject': 'MATH',
 'coursenum': '223A',
 'name': 'Lie Algebras: Representation Theory',
 'independent_study': False,
 'term': '1203',
 'description': "Theorems of Engel and Lie. Semisimple Lie algebras, Cartan's criterion. Universal enveloping algebras, PBW theorem, Serre's construction. Representation theory. Other topics as time permits. Usually offered every second year.\nAn Huang"}

# Exploring the data set
Now we will show how to use straight python to explore the data set and answer some interesting questions. Next week we will start learning Pandas/Numpy which are packages that make it easier to explore large dataset efficiently.

Here are some questions we can try to asnwer:
* what are all of the subjects of courses (e.g. COSI, MATH, JAPN, PHIL, ...)
* which terms are represented?
* how many instructors taught at Brandeis last year?
* what were the five largest course sections?
* what were the five largest courses (where we combine sections)?
* which are the five largest subjects measured by number of courses offered?
* which are the five largest courses measured by number of students taught?
* which course had the most sections taught in 20-21?
* who are the top five faculty in terms of number of students taught?
* etc.

In [6]:
#A. how many faculty taught COSI courses last year

len({course['instructor'] for course in courses if course['subject']=='COSI'})

27

In [7]:
#B. what is the total number of students taking COSI courses last year?

sum({course['enrolled'] for course in courses if course['subject']=='COSI'})

1950

In [9]:
#C. what was the median size of a COSI course last year (counting only those courses with at least 10 students)

statistics.median({course['enrolled'] for course in courses if course['subject']=='COSI' if course['enrolled'] >= 10})

45.5

In [10]:
#D. create a list of tuples (E,S) where S is a subject and E is the number of students enrolled in courses in that 
#   subject, sort it and print the top 10. This shows the top 10 subjects in terms of number of students taught.

subject = [course['subject'] for course in courses]

'''initialize a new dictionary'''
new_dict = dict.fromkeys(subject, 0)

'''record the number of students enrolled in the subject'''
for course in courses:
    new_dict[course['subject']] += course['enrolled']

'''making the dictionary to a list of tuples (E, S) where S is a subject and E is the number of students enrolled in
 courses in that subject'''
students_in_subjects = (list(new_dict.items()))

'''sorting the list by the number of students enrolled in the subject'''
top_10 = sorted(students_in_subjects, key = lambda x: x[1], reverse = True)[:10]

for course in top_10:
    print(course)

('HS', 5318)
('BIOL', 3085)
('BUS', 2766)
('HWL', 2734)
('CHEM', 2322)
('ECON', 2315)
('COSI', 2223)
('MATH', 1785)
('PSYC', 1704)
('ANTH', 1144)


In [11]:
#E. do the same as in (d) but print the top 10 subjects in terms of number of courses offered

subject = [course['subject'] for course in courses]

new_dict = dict.fromkeys(subject, 0)

'''record the number of courses offered in that subject'''
for course in courses:
    new_dict[course['subject']] += 1

courses_in_subjects = (list(new_dict.items()))

'''sorting the list by the number of courses offered in the subject'''
top_10 = sorted(courses_in_subjects, key = lambda x: x[1], reverse = True)[:10]

for course in top_10:
    print(course)

('BIOL', 613)
('HIST', 498)
('PSYC', 417)
('NEUR', 403)
('BCHM', 296)
('PHYS', 288)
('HS', 274)
('COSI', 272)
('MUS', 266)
('ENG', 265)


In [19]:
#F. do the same as (d) but print the top 10 subjects in terms of number of faculty teaching courses in that subject

subject = [course['subject'] for course in courses]

new_dict = dict.fromkeys(subject,0)

'''record the number of faculty teaching courses in that subject'''
for subject_item in new_dict:
    new_dict[subject_item]+=len({course['instructor'] for course in courses if course['subject']==subject_item})
    
courses_in_subjects = (list(new_dict.items()))

'''sorting the list by the number of courses offered in the subject'''
top_10 = sorted(courses_in_subjects, key = lambda x: x[1], reverse = True)[:10]

for course in top_10:
    print(course)

('HS', 87)
('BIOL', 67)
('ECON', 52)
('BCHM', 49)
('BUS', 47)
('HIST', 47)
('BCBP', 46)
('HWL', 42)
('MATH', 37)
('NEJS', 37)


In [20]:
#G. list the top 20 faculty in terms of number of students they taught

instructor = (course['instructor'] for course in courses) 

new_dict = dict.fromkeys(instructor, 0)

for course in courses:
    new_dict[course['instructor']] += course['enrolled'] 

top_20 = sorted(new_dict.items(), key = lambda x: x[1], reverse = True)[:20]

for course in top_20:
    print (course[0])

('Leah', 'Berkenwald', 'leahb@brandeis.edu')
('Kene Nathan', 'Piasta', 'kpiasta@brandeis.edu')
('Stephanie', 'Murray', 'murray@brandeis.edu')
('Milos', 'Dolnik', 'dolnik@brandeis.edu')
('Maria', 'de Boef Miara', 'mmiara@brandeis.edu')
('Bryan', 'Ingoglia', 'ingoglia@brandeis.edu')
('Rachel V.E.', 'Woodruff', 'woodruff@brandeis.edu')
('Timothy J', 'Hickey', 'tjhickey@brandeis.edu')
('Daniel', 'Breen', 'dbreen91@brandeis.edu')
('Melissa', 'Kosinski-Collins', 'kosinski@brandeis.edu')
('Claudia', 'Novack', 'novack@brandeis.edu')
('Antonella', 'DiLillo', 'dilant@brandeis.edu')
('Jon', 'Chilingerian', 'chilinge@brandeis.edu')
('Ahmad', 'Namini', 'anamini@brandeis.edu')
('Iraklis', 'Tsekourakis', 'tsekourakis@brandeis.edu')
('Geoffrey', 'Clarke', 'geoffclarke@brandeis.edu')
('Peter', 'Mistark', 'pmistark@brandeis.edu')
('Brenda', 'Anderson', 'banders@brandeis.edu')
('Colleen', 'Hitchcock', 'hitchcock@brandeis.edu')
('Scott A.', 'Redenius', 'redenius@brandeis.edu')


In [21]:
# H. list the top 20 courses in terms of number of students taking that course (where you 
#    combine different sections and semesters, i.e. just use the subject and course number

code = (course['code'] for course in courses)

new_dict = dict.fromkeys(code, 0)

for course in courses:
    new_dict[course['code']] += course['enrolled']

top_20 = sorted(new_dict.items(), key = lambda x: x[1], reverse = True)[:20]

for course in top_20:
    print(course[0])

('HWL', '1')
('HWL', '1-PRE')
('BIOL', '14A')
('COSI', '10A')
('PSYC', '10A')
('BIOL', '15B')
('MATH', '10A')
('BIOL', '18B')
('BIOL', '18A')
('CHEM', '29A')
('CHEM', '29B')
('CHEM', '25A')
('PSYC', '51A')
('CHEM', '25B')
('COSI', '12B')
('BUS', '6A')
('CHEM', '18A')
('ECON', '10A')
('MATH', '15A')
('ANTH', '1A')


In [22]:
#I. Create your own interesting question (each team member creates their own) and use Python
#   to answer that question.
#
#   The top 10 courses mearsured by the number of students in the waiting list -- Bohan 

top_10 = sorted(courses, key = lambda x: x['waiting'], reverse = True)[:10]
for course in top_10:
    print(course['code'])

('BIOL', '51A')
('NPSY', '22B')
('HWL', '14')
('HWL', '12')
('BIOL', '51A')
('MATH', '8A')
('BIOL', '43B')
('BIOL', '43B')
('PHIL', '23B')
('BIOL', '43B')
