# The COURSPACE Recommendation Engine

## The Data - manual vector encoding
 - Utilized NC State's Course Catalog to record 11 required courses for each of the ten chosen majors
 - These courses were then represented as vectors based on their subject (ex: Humanities, physics, engineering)

## The Input - from an intuitive user interface
- Students can input the courses they have taken throughout highschool **along with a rating of how much they enjoyed the course**
    - The rating acts as weight when trying to assess the similarity between any two highschool and college course vectors
- These courses are classified as vectors in the same way as the college major data

## The Algorithm - weighted cosine similarity evaluation
- For each major in our minimal dataset, the weighted cosine similarity between each input highschool course vector and each college course vector in any given major is computed. The cosine similarity measure is essentially a measure of the angle between any two vectors:

![](cos_sim_demo.png)

- The sum of the pairwise cosine similariy scores between each major and the inputted highschool courses are normalized by the maximum similarity score, and the three majors that produce the greatest cosine similarity score are recommended to the students


In [483]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
import numpy as np

### Now, let's get started with some demo code for this engine

In [484]:
# The cosine similarity between two indentical vectors is 1
cosine_similarity([[1,1]], [[1,1]])

array([[1.]])

In [485]:
# The cosine between two orthogonal vectors is 0
cosine_similarity([[0,1]], [[1,0]])

array([[0.]])

In [486]:
# variables for constructing feature vectors
# notice how there is a set number of subject categories, which allows us to encode a course
# as a vector
COURSE_CATEGORIES = ["social science", "humanities","mathematics","physics", 
                     "chemistry", "environmental science", 
                     "biology", "engineering","computer science"]
vector_template = [0 for i in COURSE_CATEGORIES]

# we are using cosine_similarity 
# but euclidean distance is also an option
distance_method = 1 # 0 for euclidean distance, 1 for cosine similarity

In [487]:
# Load original data (taken from NC State)
df1 = pd.read_excel("majors.xlsx",usecols = lambda col_name: ("vector" not in col_name.lower()))
df1

Unnamed: 0,Mechanical Engineering,Chemistry,English,Biology,Physics,Computer-Science,Education-Studies,Mathematics,History,Enviromental Engineering
0,CH 101 - Chemistry - A Molecular Science (3 un...,General Chemistry,American Literature (3 units: C- or better),Communication and Writing,Advanced Writing,CH 102 - General Chemistry Laboratory (1 units),ED 100 - Intro to Education (2 units),MA 402 - Mathematics of Scientific Computing (...,Pre 1600 History,BIO 183 - Introductory Biology: Cellular and M...
1,CH 102 - General Chemistry Laboratory (1 units...,Advanced Writing (3 units),British Literature (3 units: C- or better),LSC 101 - Critical and Creative Thinking in th...,PY 252 - Instrumental and Data Analysis for Ph...,PY 205 - Physics for Engineers and Scientists ...,ELP 344 - School and Society (3 units),MA 405 - Introduction to Linear Algebra (3 uni...,History I & History Breadth (6 units: C- or be...,CE 373 - Fundamentals of Environmental Enginee...
2,E 101 - Introduction to Engineering & Problem ...,Organic Chemistry,World Literature (3 units: C- or better),"BIO 181 - Introductory Biology: Ecology, Evolu...",PY 401 - Quantum Physics I (3 units: C- or bet...,MA 305 - Introductory Linear Algebra and Matri...,EDP 304 - Educational Psychology (3 units),MA 407 - Introduction to Modern Algebra for Ma...,History II & History Breadth (6 units: C- or b...,CE 250 - Introduction to Sustainable Infrastru...
3,E 115 - Introduction to Computing Environments...,Quantitative Chemistry,Film (3 units: C- or better),BIO 183 - Introductory Biology: Cellular and M...,PY 402 - Quantum Physics II (3 units: C- or be...,ST 370 - Probability and Statistics for Engine...,PSY 376 - Developmental Psychology (3 units),MA 425 - Mathematical Analysis I (3 units: C- ...,HI 300 - History Methods and Writing (3 units:...,CHE 205 - Chemical Process Principles (4 units)
4,MA 141 - Calculus I (4 units: C or better),MA 241 - Calculus II (4 units: C- or better),Linguistics (3 units: C- or better),CH 101 - Chemistry - A Molecular Science (3 un...,PY 411 - Mechanics I (3 units: C- or better),CSC 116 - Introduction to Computing - Java (3 ...,PSY 200 - Introduction to Psychology (3 units),Methods of Applied Math (3 units: C- or better),HI 4** (9 units: C- or better),CE 378 - Environmental Chemistry and Microbiol...
5,ENG 101 Acad Writing Research (4 units: C- or ...,MA 242 - Calculus III (4 units: C- or better),Rhetoric (3 units: C- or better),Genetics,PY 412 - Mechanics II (3 units: C- or better),CSC 216 - Software Development Fundamentals (3...,SOC 202 - Principles of Sociology (3 units),Math Electives (9 units: C- or better),Department Electives (3 units: C- or better),CH 201 - Chemistry - A Quantitative Science (3...
6,Economics (3 units),MA 341 - Applied Differential Equations I (3 u...,History I,Organic Chemistry & Lab (4 units),MA 405 - Introduction to Linear Algebra (3 uni...,Calculus,History (3 units),MA 242 - Calculus III (4 units: C- or better),HI 491 - Seminar in History (3 units: C- or be...,Economics (3 units)
7,CSC 113 - Introduction to Computing - MATLAB (...,BCH 451 - Principles of Biochemistry (4 units:...,Philosophy,Calculus II (3 units: C- or better),Statistics (3 units: C- or better),CSC 226 - Discrete Mathematics for Computer Sc...,Humanities Elective (3 units),MA 225 - Foundations of Advanced Mathematics (...,History of the Civil Rights Movement,MA 241 - Calculus II (4 units: C or better)
8,MA 241 - Calculus II (4 units: C or better),CH 401 - Systematic Inorganic Chemistry I (3 u...,Literature Elective,PY 131 - Conceptual Physics (4 units),PY 251 - Introduction to Scientific Computing ...,CSC 230 - C and Software Tools (0 units),Calculus I,MA 341 - Applied Differential Equations I (3 u...,AFS 373 - African American History Since 1865,PY 205 & 206 (4 units: C or better)
9,MSE 200 - Mechanical Properties of Structural ...,CH 415 - Analytical Chemistry II (3 units: C- ...,ENG 494 - Special Topics in Linguistics (3 uni...,MB 351 - General Microbiology (3 units: C- or ...,Social Sciences Elective (6 units),Computing / Numerical Methods (3 units: C- or ...,Communication (3 units),General Chemistry,From Renaissance to Revolution: The Origins of...,E 102 - Engineering in the 21st Century (2 units)


In [488]:
# The courses are converted into their subjects (done manually)
df = pd.read_excel("majors.xlsx",usecols = lambda col_name: ("vector" in col_name.lower()))
df

Unnamed: 0,Mechanical-Engineering_vectorized,Chemistry_vectorized,English_vectorized,Biology_vectorized,Physics_vectorized,Computer-Science_vectorized,Education_vectorized,Mathematics_vectorized,History_vectorized,Enviromental-Engineering_Vectorized
0,Chemistry,Chemistry,Humanities,Humanities,Humanities,Chemistry,"Humanities, Social Science","Mathematics, Computer Science",Humanities,Biology
1,Chemistry,Humanities,Humanities,Biology,"Physics, Mathematics","Physics, Engineering","Humanities, Social Science",Mathematics,Humanities,"Environmental Science, Engineering"
2,Engineering,Chemistry,Humanities,"Biology, Environmental Science","Physics, Mathematics",Mathematics,"Humanities, Social Science, Biology",Mathematics,Humanities,"Environmental Science, Engineering"
3,Computer Science,"Mathematics, Chemistry",Humanities,Chemistry,"Physics, Mathematics","Mathematics, Engineering","Social Science, Biology",Mathematics,Humanities,Chemistry
4,Mathematics,Mathematics,Social Science,Chemistry,"Physics, Mathematics",Computer Science,"Social Science, Biology","Mathematics, Computer Science",Humanities,"Environmental Science, Chemistry"
5,"Humanities, Engineering",Mathematics,Social Science,"Biology, Mathematics","Physics, Mathematics","Computer Science, Engineering",Social Science,"Mathematics, Computer Science, Engineering","Humanities, Social Science","Mathematics, Chemistry"
6,Social Science,Mathematics,Humanities,Chemistry,Mathematics,Mathematics,Humanities,Mathematics,Humanities,Social Science
7,"Computer Science, Engineering","Biology, Chemistry",Humanities,Mathematics,Mathematics,"Mathematics, Computer Science",Humanities,Mathematics,"Humanities, Social Science",Mathematics
8,Mathematics,Chemistry,Humanities,Physics,Computer Science,Computer Science,Mathematics,Mathematics,"Humanities, Social Science",Physics
9,"Engineering, Physics",Chemistry,"Humanities, Social Science","Biology, Environmental Science",Social Science,"Computer Science, Mathematics","Humanities, Social Science",Chemistry,Humanities,Engineering


### Here are the functions that drive this engine

In [489]:
def string_to_vec(str):
    '''converts a string with a course subject keyword into a one-hot encoded vector'''
    element_list = str.lower().split(",")
    vector = vector_template.copy()
    for element in element_list:
        element_index_in_vector = COURSE_CATEGORIES.index(element.strip().lower())
        vector[element_index_in_vector] = 1
    return vector

In [490]:
# TAKE A LOOK AT HOW THE STRING_TO_VEC FUNCTION WORKS
# demo for string_to_vec function
print("Vector for a Chemistry Course:",string_to_vec("Chemistry"))
print("Vector for a Computer Science Course:",string_to_vec("Computer Science"))
print("Vector for a Chemistry & Computer Science Course (eg: Computational Chemistry):",string_to_vec("Computer Science, Chemistry"))

Vector for a Chemistry Course: [0, 0, 0, 0, 1, 0, 0, 0, 0]
Vector for a Computer Science Course: [0, 0, 0, 0, 0, 0, 0, 0, 1]
Vector for a Chemistry & Computer Science Course (eg: Computational Chemistry): [0, 0, 0, 0, 1, 0, 0, 0, 1]


In [491]:
# VECTORIZED DATA, APPLY STRING_TO_VEC TO EACH CELL IN OUR DATAFRAME
# apply this to each cell in our data
vector_df = df.copy().applymap(string_to_vec)
vector_df.columns = (column.split("_")[0] for column in vector_df.columns)
vector_df

Unnamed: 0,Mechanical-Engineering,Chemistry,English,Biology,Physics,Computer-Science,Education,Mathematics,History,Enviromental-Engineering
0,"[0, 0, 0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 1, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 1, 0, 0, 0, 0]","[1, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 1]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 1, 0, 0]"
1,"[0, 0, 0, 0, 1, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 1, 0, 0]","[0, 0, 1, 1, 0, 0, 0, 0, 0]","[0, 0, 0, 1, 0, 0, 0, 1, 0]","[1, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 1, 0, 1, 0]"
2,"[0, 0, 0, 0, 0, 0, 0, 1, 0]","[0, 0, 0, 0, 1, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 1, 1, 0, 0]","[0, 0, 1, 1, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[1, 1, 0, 0, 0, 0, 1, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 1, 0, 1, 0]"
3,"[0, 0, 0, 0, 0, 0, 0, 0, 1]","[0, 0, 1, 0, 1, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 1, 0, 0, 0, 0]","[0, 0, 1, 1, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 1, 0]","[1, 0, 0, 0, 0, 0, 1, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 1, 0, 0, 0, 0]"
4,"[0, 0, 1, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 1, 0, 0, 0, 0]","[0, 0, 1, 1, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 1]","[1, 0, 0, 0, 0, 0, 1, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 1]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 1, 1, 0, 0, 0]"
5,"[0, 1, 0, 0, 0, 0, 0, 1, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 1, 0, 0]","[0, 0, 1, 1, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 1]","[1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 1, 1]","[1, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 1, 0, 0, 0, 0]"
6,"[1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 1, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[1, 0, 0, 0, 0, 0, 0, 0, 0]"
7,"[0, 0, 0, 0, 0, 0, 0, 1, 1]","[0, 0, 0, 0, 1, 0, 1, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 1]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[1, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]"
8,"[0, 0, 1, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 1, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 1, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 1]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0]","[1, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 1, 0, 0, 0, 0, 0]"
9,"[0, 0, 0, 1, 0, 0, 0, 1, 0]","[0, 0, 0, 0, 1, 0, 0, 0, 0]","[1, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 1, 1, 0, 0]","[1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 1]","[1, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 1, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0]"


In [492]:
# first, we create some dummy input
input = vector_df["Physics"].tolist() # choose the physics collumn as our input
                                      # if our engine works, physics and related subjects should be recomended
input

[[0, 1, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 1],
 [1, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 1, 0, 0, 0, 0]]

In [493]:
def get_course_distance_to_each_input(college_course, method = 1):
    total_course_distance = 0
    
    
    if method==1:
        func = cosine_similarity
    elif method == 0:
        func = euclidean_distances
        
        
    for hs_course in input:
        total_course_distance+=func([college_course],[hs_course])
    return total_course_distance

In [494]:
def get_major_distance_from_input(major):
    college_course_vectors = vector_df[major].tolist()
    total_distance = 0
    for college_course_vector in college_course_vectors:
        total_distance+=get_course_distance_to_each_input(college_course_vector, method = distance_method)
    num_of_comparisons = len(input)*len(college_course_vectors)
    major_distance = total_distance/num_of_comparisons
    return major_distance

In [495]:
def get_best_course():
    major_distances = []
    majors = vector_df.columns
    for major in majors:
        major_distances.append(get_major_distance_from_input(major))
   
    if distance_method == 1:
        reverse_bool = True
    else:
        reverse_bool = False
    
    major_distances_norm = normalize_scores(major_distances)
    
    scores = sorted(zip(major_distances_norm, majors), reverse=reverse_bool)
    return scores

In [496]:
def normalize_scores(scores):
    '''normalize by max value in list'''
    norm = [(float(i)/max(scores))*100 for i in scores] # multiply by 100 to format into a percent
    return norm

In [497]:
# run the engine
get_best_course()

[(array([[100.]]), 'Physics'),
 (array([[91.93379326]]), 'Mathematics'),
 (array([[61.47878065]]), 'Computer-Science'),
 (array([[49.97177211]]), 'Chemistry'),
 (array([[41.99961255]]), 'Mechanical-Engineering'),
 (array([[37.58081002]]), 'Enviromental-Engineering'),
 (array([[33.20483193]]), 'Biology'),
 (array([[32.14540804]]), 'Education'),
 (array([[23.47936183]]), 'History'),
 (array([[21.89057549]]), 'English')]