In [144]:
import pandas as pd
import numpy as np
import json

## Read in scraped data -> DF
Here we are using scraped data for courses provided by [David Tejuosho](https://github.com/DavidTeju). This algorithm serves as the foundation for [CourseWeb](https://github.com/DavidTeju/CourseWeb.git).

In [170]:
df = pd.read_json("../Data/courseTopicSets.json")

In [171]:
df

Unnamed: 0,courseCode,topicSet
0,AMST 110,{}
1,AMST 143,{}
2,AMST 144,{}
3,AMST 170,{}
4,AMST 200,"{'american': 4, 'studies': 3, 'cutting-edge': ..."
...,...,...
537,WII 355,"{'designed': 1, 'wii': 1, 'reflect': 2, 'exami..."
538,WII 357,"{'washington': 1, 'internship': 2, 'program': ..."
539,WII 358,"{'international': 2, 'foreign': 2, 'policy': 1..."
540,WII 359,"{'environmental': 3, 'sustainability': 1, 'int..."


### Get topic string
Joins together every topic word into a single string from keys
```JSON

{ Course : Count }
```
Returns joined string

In [147]:
def topic_string(course_df):
    data = []
    for i in course_df.topicSet.keys():
        if len(i) > 1:
            data.append(i)

    return ' '.join(data)

### Get topic set
Creates dictionary of `key: word` & `value: count` for a course. Final 3 columns contain
- `d`: total length of document in $ d_i$ words
- `code`: course code
- `topic-string`: joined string of every topic word in description

Returns dictionary containing all of the above

In [148]:
def get_topic_set(course_df):
    data = {}
    for k,v in course_df.topicSet.items():
        if len(k) > 1:
            data[k] = float(v)

    data['d'] = float(len(data.keys()))
    data['code'] = course_df.courseCode
    data['topic-string'] = topic_string(course_df)
    return data

### Test Dict
Check if desired output is returned

In [149]:
test = get_topic_set(df.iloc[4])

In [150]:
test

{'american': 4.0,
 'studies': 3.0,
 'cutting-edge': 1.0,
 'interdisciplinary': 2.0,
 'field': 1.0,
 'humanities': 1.0,
 'helps': 1.0,
 'answer': 1.0,
 'critical': 1.0,
 'questions': 1.0,
 'society': 1.0,
 'culture': 1.0,
 'approach': 1.0,
 'understanding': 1.0,
 'multicultural': 1.0,
 'world': 1.0,
 'introduce': 1.0,
 'theories': 1.0,
 'methods': 1.0,
 'chicago': 1.0,
 'text': 1.0,
 'takes': 1.0,
 'close': 1.0,
 'city’s': 1.0,
 'people': 1.0,
 'history': 1.0,
 'art': 1.0,
 'architecture': 1.0,
 'literature': 1.0,
 'd': 29.0,
 'code': 'AMST 200',
 'topic-string': 'american studies cutting-edge interdisciplinary field humanities helps answer critical questions society culture approach understanding multicultural world introduce theories methods chicago text takes close city’s people history art architecture literature'}

Check as DF

In [151]:
test_df = pd.DataFrame(index=[test['code']], data=[test])

In [152]:
test_df

Unnamed: 0,american,studies,cutting-edge,interdisciplinary,field,humanities,helps,answer,critical,questions,...,close,city’s,people,history,art,architecture,literature,d,code,topic-string
AMST 200,4.0,3.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,29.0,AMST 200,american studies cutting-edge interdisciplinar...


### Courses-to-DF
Takes all rows in original DataFrame and creates a new list dictionaries. Used to make a new DataFrame of the following structure: (Columns names may be different)

| $D_i$ | Code | Topic String | Words ($w_i$)    |
|-------|------|---|------------------|
| float | str  | str | float $\vee$ NaN |

Numeric values are stored as NumPy values as per se Pandas

In [153]:
def courses_to_df(df):
    data = []
    for i in range(df.__len__()):
        data.append(get_topic_set(df.iloc[i]))

    return data

In [154]:
data = pd.DataFrame(courses_to_df(df))

In [155]:
data

Unnamed: 0,d,code,topic-string,american,studies,cutting-edge,interdisciplinary,field,humanities,helps,...,fits,branches,intern,host,embassies,nongovernmental,geared,nation’s,non-governmental,corporation
0,0.0,AMST 110,,,,,,,,,...,,,,,,,,,,
1,0.0,AMST 143,,,,,,,,,...,,,,,,,,,,
2,0.0,AMST 144,,,,,,,,,...,,,,,,,,,,
3,0.0,AMST 170,,,,,,,,,...,,,,,,,,,,
4,29.0,AMST 200,american studies cutting-edge interdisciplinar...,4.0,3.0,1.0,2.0,1.0,1.0,1.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
537,39.0,WII 355,designed wii reflect examine role individual c...,,,,,,,,...,,,,,,,,,,
538,21.0,WII 357,washington internship program welcomes majors ...,,,,,,,,...,1.0,1.0,,,,,,,,
539,16.0,WII 358,international foreign policy internship progra...,,,,,,,,...,,,1.0,1.0,1.0,1.0,,,,
540,12.0,WII 359,environmental sustainability internship progra...,,1.0,,,,,,...,,,1.0,,,,1.0,,,


Playground cell

In [156]:
len(data.index)

542

## BM25 Scoring Structure

### $L$-Corpus
We find the summation of all documents $d_i$ in the corpus of size $N$, and divide by corpus size $N$ to find the average document size.
$$
L = \frac{\sum_i|d_i|}{N}
$$

In [157]:
def l_corpus(corpus):
    """
    Returns average document length in corpus
    :param corpus: number of total documents
    :return: sum of all words in corpus divided by corpus size
    """
    return np.divide(corpus.sum(),corpus.size)

In [158]:
avg_dl = l_corpus(data.d)

### Total Frequency ($TF$)
The function below takes in a query and a document and returns the total number of times the query appears in document<sub>j<sub>.

In [159]:
def total_frequency(query, doc):
    """
    Returns total number of times query appears in document
    :param query: keyword being searched
    :param doc: document to check
    :return: word frequency or 0.0
    """
    try:
        return doc[query]
    except KeyError as e:
        return float(0)

In [160]:
total_frequency("not-a-word", data.iloc[4])

0.0

### Document Frequency ($DF$)
The function below takes in a query and the corpus to compute the total number of documents containing the query.

In [161]:
def document_frequency(query, corpus):
    """
    Returns total number of documents containing query
    :param query: keyword being searched
    :param corpus: document DataFrame
    :return: number of documents containing query or O.0
    """
    try:
        return corpus[query].count()
    except KeyError as e:
        return float(0)

In [162]:
document_frequency("computer", data)

13

### Inverse Document Frequency
The function below takes in a query and the corpus to computer the inverse document frequency of the query. The function can be expressed as:
$$
IDF(q_i) = log(\frac{N-DF(q_i)+0.5}{DF(q_i)+0.5})
$$
See [Athens University of Economics and Business](http://ipl.cs.aueb.gr/stougiannis/bm25.html) for more

In [163]:
def inverse_document_frequency(query, corpus):
    """
    Inverse document frequency of query
    :param query: keyword being searched
    :param corpus: document DataFrame
    :return: np.float64
    """
    N = len(corpus.index)
    DF = document_frequency(query, corpus)
    try:
        eval_nested = np.divide((N - DF + 0.5),DF + 0.5 )
        return np.log(eval_nested)
    except KeyError as e:
        return float(0)

In [164]:
inverse_document_frequency("computer", data)

3.6692434795970774

### BM25 Scoring Function
The BM25 returns a `score`, i.e. a `rank` in our case, given a query. It takes in a document, a query, and the corpus. The summation of all scores per term in the query is computed to give a `score`. The following expression can be used to interpret the function. <i>Note: this is an implementation of the BM25 scoring function from [Badri Adhikari](https://youtu.be/a3sg6MH8m4k) and [Athens University of Economics and Business](http://ipl.cs.aueb.gr/stougiannis/bm25.html).</i>
$$
BM25(d_i, q_N) = \sum_{i=1}^{N}
\begin{equation*}
IDF(q_i)\times\frac{TF(q_i, d_j)\times(k+1)}{TF(q_i,d_j)+k\times(1-b+b\times\frac{d_j}{L})}
\end{equation*}
$$
1. $d_j$ represents a document, or a course in our case
2. $q_N$ represents a query term. E.g. given `computer science`, $q_1$ refers to `computer` and $q_2$ refers to `science` (reminder: $i$ would begin at 0 in code)
3. $L$ refers to our `l_corpus` function
4. $TF$ refers to our `total_frequency` function
5. $IDF$ refers to our `inverse_document_frequency` function


In [165]:
def BM25(document, query, corpus):
    # Define constants
    query = query.split(' ')
    limit = len(query)
    k = 2 # free params
    b = 0.75 # free params
    summation = 0

    # Summation applied
    for i in range(limit):
        IDF = inverse_document_frequency(query[i], corpus)
        TF_SUB_EXP_NUM = np.multiply(total_frequency(query[i], document), (k+1))
        TF_SUB_EX_DEN = np.multiply(total_frequency(query[i], document) + k, (1-b+b*(document.d/l_corpus(corpus.d))))
        TF_SUB_EVAL = np.divide(TF_SUB_EXP_NUM, TF_SUB_EX_DEN)
        summation += np.multiply(IDF, TF_SUB_EVAL)

    return summation

Check function works
Test with course `CPSC 450` - CS course

In [166]:
BM25(data.iloc[216], "computer", data)

6.063480720536059

Now we run BM25 for all courses

In [167]:
scores = []
for i in range(data.__len__()):
    scores.append([BM25(data.iloc[i], "computer science", data), i])

Convert to a dataframe for easy analysis

In [168]:
ranks = pd.DataFrame(scores)
ranks = ranks.sort_values(by=[0])

Validate output

In [169]:
ranks

Unnamed: 0,0,1
192,4.400427,192
213,4.400427,213
291,5.142949,291
219,5.322574,219
188,5.677306,188
...,...,...
537,,537
538,,538
539,,539
540,,540
