# Heuristics -> Metrics
1. Course Code
  1. Department code
    1. Clustering based on crosslistings
    2. More generally, if the dept changes AND there is no crosslisting with the old dept, there is a red flag
  2. Course number
    1. 1st int - ~~The first int in course number means a lot if there is a significant change~~ There are many cases where a 1- or 200 level class is crosslisted as a graduate level class. Only really means anything if you go from 100 -> 400 etc
    2. 2nd - 4th int - course numbering is somewhat arbitrary therefore does sometimes change
2. Title
  1. ~~Changes infrequently? -> direct comparison based on string distance would work~~ -> Want some semantic similarity because string distance approach would assume "Prog. Langauges I" and "Prog. Languages II" are the same
    1. Could try to take into account semantic similary based on something like [sematch](http://gsi-upm.github.io/sematch/)
3. Prerequisites
  1. Do a recursive call to test the difference between the courses in the pre-requisites

## Special Cases
* Course diverges into 2 different courses
* "Cross-listings" with different titles
* Cross listings with different codes but the same title

# Questions to answer
1. How often does title change?
2. How often does description change?
3. What are the scale of these kinds of changes?
4. How often do prerequisites change?
5. Are there any correlations between when these changes occur?
6. How often does the most naive approach work?

In [2]:
import numpy as np
import pandas as pd
import csv

In [13]:
lines = []
with open("course_data.csv") as f:
    data_reader = csv.reader(f)
    with open('course_data_courses_only.csv', 'w', newline='') as fout:
        data_writer = csv.writer(fout, delimiter=',', quoting=csv.QUOTE_MINIMAL)
        for i, row in enumerate(data_reader):
            if (row[0] == "courses"):
                data_writer.writerow(row)
                lines.append(i)

[327,
 328,
 329,
 330,
 331,
 332,
 333,
 334,
 335,
 336,
 337,
 338,
 339,
 340,
 341,
 342,
 343,
 344,
 345,
 346,
 347,
 348,
 349,
 350,
 351,
 352,
 353,
 354,
 355,
 356,
 357,
 358,
 359,
 360,
 361,
 362,
 363,
 364,
 365,
 366,
 367,
 368,
 369,
 370,
 371,
 372,
 373,
 374,
 375,
 376,
 377,
 378,
 379,
 380,
 381,
 382,
 383,
 384,
 385,
 386,
 387,
 388,
 389,
 390,
 391,
 392,
 393,
 394,
 395,
 396,
 397,
 398,
 399,
 400,
 401,
 402,
 403,
 404,
 405,
 406,
 407,
 408,
 409,
 410,
 411,
 412,
 413,
 414,
 415,
 416,
 417,
 418,
 419,
 420,
 421,
 422,
 423,
 424,
 425,
 426,
 427,
 428,
 429,
 430,
 431,
 432,
 433,
 434,
 435,
 436,
 437,
 438,
 439,
 440,
 441,
 442,
 443,
 444,
 445,
 446,
 447,
 448,
 449,
 450,
 451,
 452,
 453,
 454,
 455,
 456,
 457,
 458,
 459,
 460,
 461,
 462,
 463,
 464,
 465,
 466,
 467,
 468,
 469,
 470,
 471,
 472,
 473,
 474,
 475,
 476,
 477,
 478,
 479,
 480,
 481,
 482,
 483,
 484,
 485,
 486,
 487,
 488,
 489,
 490,
 491,
 492,
 493

In [18]:
df = pd.read_csv('course_data_courses_only.csv', verbose=True, names= [
        "type",
        "id",
        "semester",
        "department_id",
        "code",
        "title",
        "description",
        "full_code",
        "prerequisites",
        "primary_listing_id",
])

Tokenization took: 86.53 ms
Type conversion took: 59.01 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 44.01 ms
Type conversion took: 43.01 ms
Parser memory cleanup took: 0.00 ms


In [28]:
grouped_by_full_code = df.groupby(by="full_code")
grouped_courses.get_group("CIS-120")

Unnamed: 0,type,id,semester,department_id,code,title,description,full_code,prerequisites,primary_listing_id
2814,courses,59629,2017A,1,120,Prog Lang & Tech I,,CIS-120,,59629
7930,courses,59356,2014A,1,120,Prog Lang & Tech I,,CIS-120,,59356
11345,courses,116187,2022A,1,120,Programming Languages and Techniques I,A fast-paced introduction to the fundamental c...,CIS-120,,116187
17254,courses,59085,2011A,1,120,Prog Lang & Tech I,,CIS-120,,59085
21512,courses,59115,2010C,1,120,Prog Lang & Tech I,,CIS-120,,59115
24479,courses,59058,2008A,1,120,Prog Lang & Tech I,,CIS-120,,59058
26487,courses,59507,2016A,1,120,Prog Lang & Tech I,,CIS-120,,59507
31306,courses,59700,2018C,1,120,Prog Lang & Tech I,,CIS-120,,59700
35671,courses,76408,2019A,1,120,Prog Lang & Tech I,,CIS-120,,76408
38494,courses,110292,2021C,1,120,Programming Languages and Techniques I,A fast-paced introduction to the fundamental c...,CIS-120,,110292


In [30]:
grouped_by_dept_id = df.groupby(by="department_id")
grouped_by_dept_id.get_group(1)

Unnamed: 0,type,id,semester,department_id,code,title,description,full_code,prerequisites,primary_listing_id
840,courses,58901,2006C,1,500,Software Foundations,,CIS-500,,58901
841,courses,58902,2006C,1,501,Computer Architecture,,CIS-501,,58902
842,courses,58903,2006C,1,520,Intro Artificial Intell,,CIS-520,,58903
843,courses,58904,2006C,1,537,Biomed Image Analysis,,CIS-537,,58904
844,courses,58905,2006C,1,550,Database & Info Systems,,CIS-550,,58905
...,...,...,...,...,...,...,...,...,...,...
119394,courses,59244,2012A,1,565,Gpu Programming&Arch,,CIS-565,,59244
119395,courses,59245,2012A,1,568,Game Design Practicum,,CIS-568,,59245
119396,courses,59246,2012A,1,580,Machine Perception,,CIS-580,,59246
119397,courses,59247,2012A,1,518,Fin Model Th & Desc Comp,,CIS-518,,44468


# Test Dataset
* Sampling of course pairings that are both valid and invalid (couple hundred of each)
* Valid: get y.o.y course pairings with high similarity
  * Get some entire course lineages, too
* Invalid 
  * Randomly select courses from same dept
  * Some courses that are completely different
## Groupings to get
1. Same dept/semester different course code (would have to verify courses are actually different)
2. Same dept, diff semester, diff course code
3. Same dept, diff semester, same course code
4. (Special case) same course code, different content -> this should always be manually handled!
5. (Special case) 

## Method
1. Group courses into departments, semesters, years
2. Within each dept, select 2 random course code at a time and store them in an aggregated set and then pairwise
3. For each course in the set, get a function to get its entire history in order with yoy pairs
  1. Also randomly grab rows from the course's history
  
## 

In [None]:
random_state = 101

In [None]:
for dept_id, dept_id_group in grouped_by_dept_id.__iter__():
    dept_id_group.sample(100)

# Decision trees
* Probably the most natural datastructure since the first approximation should always be the course code similarity 

## Prerequisite similarity metrics
* If all prereqs change then probably a dept wide change
* Best served by going through the prereqs first
  * If the 2 prereqs match based on this algorithm, then base it off of that --> recursive call
* Take average prerequisite similarity?
  * What if very few prereqs match? -> probably an issue
  * Generally want one of them to be a subset of the other 
  * Maybe include an adjustment factor based on how many prereqs they have (since if there are no prereqs for 1 then similarity might be skewed)
 

## Course Title and Description Similarity Metrics
* Use direct character similarity
* Use semantic similarity metrics (ie like described [here](https://towardsdatascience.com/semantic-similarity-using-transformers-8f3cb5bf66d6) using pytorch with sentence transformers)

In [None]:
# Very Naive Algorithm
def naive_is_same(course1, course2):
    if 