#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020


# Homework 2:  PageRank + Learning to Rank

### 100 points [10% of your final grade]

### Due: March 5, 2020 by 11:59pm

*Goals of this homework:* In this homework you will explore real-world challenges of building a graph (in this case, from tweets), implement and test the classic PageRank algortihm over this graph. In addition, you will apply learning to rank to a real-world dataset and report the performance in terms of NDCG.

*Submission instructions (eCampus):* To submit your homework, rename this notebook as `UIN_hw2.ipynb`. For example, my homework submission would be something like `555001234_hw2.ipynb`. Submit this notebook via eCampus (look for the homework 2 assignment there). Your notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so). So please run all the cells for us, and then submit.

*Late submission policy:* For this homework, you may use as many late days as you like (up to the 5 total allotted to you).

*Collaboration policy:* You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by **filling out the Collaboration Declarations at the bottom of this notebook**. 

*Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.*

The basic rule is that no student should explicitly share a solution with another student (and thereby circumvent the basic learning process), but it is okay to share general approaches, directions, and so on. If you feel like you have an issue that needs clarification, feel free to contact either me or the TA.

# Part 1: PageRank (60 points)
In this assignment, we're going to adapt the classic PageRank approach to allow us to find not the most authoritative web pages, but rather to find significant Twitter users. 


## Part 1.1: A re-Tweet Graph (20 points)

So, instead of viewing the world as web pages with hyperlinks (where pages = nodes, hyperlinks = edges), we're going to construct a graph of Twitter users and their retweets of other Twitter users (so user = node, retweet of another user = edge). Over this Twitter-user graph, we can apply the PageRank approach to order the users. The main idea is that a user who is retweeted by other users is more "impactful". 

Here is a toy example. Suppose you are given the following four retweets:

* **userID**: diane, **text**: "RT ", **sourceID**: bob
* **userID**: charlie, **text**: "RT Welcome", **sourceID**: alice
* **userID**: bob, **text**: "RT Hi ", **sourceID**: diane
* **userID**: alice, **text**: "RT Howdy!", **sourceID**: parisa

There are four short tweets retweeted by four users. The retweet between users form a directed graph with five nodes and four edges. E.g., the "diane" node has a directed edge to the "bob" node.

You should build a graph by parsing the tweets in the file we provide called *PageRank.json*.

**Notes:**

* You may see some weird characters in the content of tweets, just ignore them. 
* The edges are binary and directed. If Bob retweets Alice once, in 10 tweets, or 10 times in one tweet, there is an edge from Bob to Alice, but there is not an edge from Alice to Bob.
* If a user retweets herself, ignore it.
* Correctly parsing screen_name in a tweet is error-prone. Use the id of the user (this is the user who is re-tweeting) and the id of the user in the retweeted_status field (this is the user who is being re-tweeted; that is, this user created the original tweet).
* Later you will need to implement the PageRank algorithm on the graph you build here.


In [1]:
# Here define your function for building the graph by parsing 
# the input file of tweets
# Insert as many cells as you want

In [2]:
# Call your function to print out the size of the graph, 
# i.e., the number of nodes and edges
# How you maintain the graph is totaly up to you
# However, if you encounter any memory issues, we recommend you 
#write the graph into a file, and load it later.

In [9]:
import json
from collections import defaultdict

edges = 0
graph = defaultdict(dict)


def addEdge(userID, sourceID):
    if sourceID in graph[userID]:
        graph[userID][sourceID] = 1
    else:
        graph[userID][sourceID] = 1
    if sourceID not in graph:
        graph[sourceID][userID] = 0 
        

data_lines = open('HITS.json', encoding='UTF-8')
for line in data_lines:
    data = json.loads(line)
    userID = data['user']['id']
    sourceID = data['retweeted_status']['user']['id']
    if userID == sourceID:
        continue
    addEdge(userID, sourceID)
data_lines.close()
    
    
def key_id(keys_sorted):
    keys_id_dict = {}
    j = 0
    for i in keys_sorted:
        keys_id_dict[i] = j
        j += 1
    return keys_id_dict


# graph -> matrix
def get_matrix(graph):
    keys_sorted = sorted(graph.keys())
    size = len(keys_sorted)
    M = [[0] * size for i in range(size)]
    keys_id_dict = key_id(keys_sorted)
    for k1 in keys_sorted:
        for k2 in keys_sorted:
            if k1 == k2:
                M[keys_id_dict[k1]][keys_id_dict[k2]] = 0
            try:
                M[keys_id_dict[k1]][keys_id_dict[k2]] = graph[k1][k2]
            except:
                M[keys_id_dict[k1]][keys_id_dict[k2]] = 0

    return M, keys_id_dict


M, keys_id_dict = get_matrix(graph)


import numpy as np
b = np.array(M,dtype = float)
print("The number of nodes: ", b.shape[0])
# print(b.shape[1])


count = 0
for i in range(b.shape[0]):
    for j in range(b.shape[1]):
        if M[i][j] == 1:
            count += 1
print("The number of edges: ", count)

The number of nodes:  1003
The number of edges:  6177


We will not check the correctness of your graph. However, this will affect the PageRank results later.

## Part 1.2: PageRank Implementation (30 points)

Your program will return the top 10 users with highest PageRank scores. The **output** should be like:

* user1 - score1
* user2 - score2
* ...
* user10 - score10

You should follow these **rules**:

* Assume all nodes start out with equal probability.
* The probability of the random surfer teleporting is 0.1 (that is, the damping factor is 0.9).
* If a user is never retweeted and does not retweet anyone, their PageRank scores should be zero. Do not include the user in the calculation.
* It is up to you to decide when to terminate the PageRank calculation.
* There are PageRank implementations out there on the web. Remember, your code should be **your own**.


**Hints**:
* If you're using the matrix style approach, you should use [numpy.matrix](https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html).
* Scipy is built on top of Numpy and has support for sparse matrices. You most likely will not need to use Scipy unless you'd like to try out their sparse matrices.
* If you choose to use Numpy (and Scipy), please make sure your Anaconda environment include their latest versions.
* Test your parsing and PageRank calculations using a handful of tweets, before moving on to the entire file we provide.
* We will evaluate the user ranks you provide as well as the quality of your code. So make sure that your code is clear and readable.

What is the termination condition in your PageRank implementation? Describe it below:

*ADD YOUR ANSWER HERE*

In [3]:
# Here add your code to implement a function called PageRanker
# Insert as many cells as you want

# def PageRanker(...):
#    ...


In [10]:
a = np.transpose(b)


def graphMove(a):
    c = np.zeros((1003, 1003),dtype = float)
    for i in range(1003):
        for j in range(1003):
            if b[j].sum()==0:
                c[i][j] = 0
            else:
                c[i][j] = a[i][j] / (b[j].sum())
    return c


def firstPr():
    pr = np.zeros((1003, 1), dtype=float)
    for i in range(1003):
        pr[i] = float(1)/1003
    return pr


mm = np.zeros((1003, 1003), dtype=float)
for i in range (1003):
    for j in range (1003):
        mm[i][j] = 1/1003
        
def PageRanker(p,m,pr):
    T = p * m + (1 - p) * mm
    pr = np.transpose(pr)
    for i in range(100):
        pr = np.dot(pr, T)
        pr /= np.sum(pr)
    return pr

M = graphMove(a)
M = np.transpose(M)
pr = firstPr()
p = 0.9

In [4]:
# Now let's call your function on the graph you've built. Output the results.

In [11]:
PR = PageRanker(p,M,pr)


# print
def key_id(keys_sorted):
    keys_id_dict = {}
    j = 0
    for i in keys_sorted:
        keys_id_dict[i] = j
        j += 1
    return keys_id_dict


pr_dict = {}
for i in range(1003):
    pr_dict[i] = PR[0][i]

    
values = sorted(pr_dict.items(), key=lambda d: d[1], reverse=True)  


for i in range(10):
    for k,v in keys_id_dict.items():
        if v == values[i][0]:
            print(k, '\t', values[i][1])

1183906148 	 0.03690928484050658
2598548166 	 0.023700823719667377
3019659587 	 0.022420060162606687
3077695572 	 0.022046237452155906
3154266823 	 0.01990739513690696
3042570996 	 0.01977500924476432
3068694151 	 0.01954808142465658
3264645911 	 0.017964127148931492
3082766914 	 0.017457986143068278
571198546 	 0.016351985674481474


## Part 1.3: Improving PageRank (10 points)
In the many years since PageRank was introduced, there have been many improvements and extensions. For this part, you should experiment with one such improvement and then compare the results you get with the original results in Part 1.2. 

In [5]:
# Here add your code

In [12]:
graph = defaultdict(dict)


def addEdge(userID, sourceID):
    if sourceID in graph[userID]:
        graph[userID][sourceID] += 1
    else:
        graph[userID][sourceID] = 1
    if sourceID not in graph:
        graph[sourceID][userID] = 0
        

data_lines = open('HITS.json', encoding='UTF-8')
for line in data_lines:
    data = json.loads(line)
    userID = data['user']['id']
    sourceID = data['retweeted_status']['user']['id']
    if userID == sourceID:
        continue
    addEdge(userID, sourceID)
data_lines.close()
    
    
M, keys_id_dict = get_matrix(graph)


b = np.array(M,dtype = float)
a = np.transpose(b)
M = graphMove(a)
M = np.transpose(M)
pr = firstPr()
p = 0.9


PR = PageRanker(p,M,pr)


pr_dict = {}
for i in range(1003):
    pr_dict[i] = PR[0][i]

    
values = sorted(pr_dict.items(), key=lambda d: d[1], reverse=True)  
   
    
for i in range(10):
    for k,v in keys_id_dict.items():
        if v == values[i][0]:
            print(k, '\t', values[i][1])

3042570996 	 0.058613667318172254
2860872854 	 0.04766331262265769
1183906148 	 0.0338283838786767
3142161801 	 0.026258939442898775
610166901 	 0.025730931339386323
3154266823 	 0.02567377376546478
2598548166 	 0.023083933063185732
3198584744 	 0.021613601439217874
3156878078 	 0.020016610074063287
3169039209 	 0.019856262064168316


In [6]:
# Plus be sure to describe your extension (what is it? 
The edges are not binary any more. Instead, the value of the edge in matrix is the number of edge from userID to sourceID.
# why did you choose it?) and your comparison to Part 1.2
Because if A retweets B more times, it means B is of more value or importance.
So by this method, the top10 results are more reasonable than Part 1.2

# Part 2: Learning to Rank (40 points)

For this part, we're going to play with some Microsoft LETOR data that has query-document relevance judgments. Let's see how learning to rank works in practice. 

First, you will need to download the MQ2008.zip file from the Resources tab on Piazza. This is data from the [Microsoft Research IR Group](https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/).

The data includes 15,211 rows. Each row is a query-document pair. The first column is a relevance label of this pair (0,1 or 2--> the higher value the more related), the second column is query id, the following columns are features, and the end of the row is comment about the pair, including id of the document. A query-document pair is represented by a 46-dimensional feature vector. Features are a numeric value describing a document and query such as TFIDF, BM25, Page Rank, .... You can find compelete description of features from [here](https://arxiv.org/ftp/arxiv/papers/1306/1306.2597.pdf).

The good news for you is the dataset is ready for analysis: It has already been split into 5 folds (see the five folders called Fold1, ..., Fold5).


## Part 2.1: Build Point-wise Learning to Rank  (20 points)
First, you should build a point-wise Learning to Rank framework. 
1. You could train a binary classification model like SVM or logistic regression on the train file. In this case, 0 is treated as negative (irrelevant) sample and 1, 2 are treated as positive (relevant) sample.
2. You apply the already trained model to predict the scores for documents on test file.
3. Order the documents based on the scores.

add your results and discussion here

In [13]:
def createTrainDataSet(path):
    f = open(path, encoding='UTF-8')
    line = f.readline()
    x_train = []
    y_train = []
    while line:
        each_line_list = []
        j = 2
        line_list = line.split()
        for i in range(46):
            each_line_list.append(float(line_list[j].split(":")[1]))
            j = j + 1
        x_train.append(each_line_list)
        y_train.append(int(line_list[0]))
        line = f.readline()
    f.close()
    x_train = np.array(x_train)
    y_train = np.array(y_train)
    for i in range(len(y_train)):
        if y_train[i] != 0:
            y_train[i] = 1
    return x_train, y_train
            
            
def createValiDataSet(path):
    f = open(path, encoding='UTF-8')
    line = f.readline()
    x_vali = []
    y_vali = []
    while line:
        each_line_list = []
        j = 2
        line_list = line.split()
        for i in range(46):
            each_line_list.append(float(line_list[j].split(":")[1]))
            j = j + 1
        x_vali.append(each_line_list)
        y_vali.append(int(line_list[0]))
        line = f.readline()
    f.close()
    x_vali = np.array(x_vali)
    y_vali = np.array(y_vali)
    for i in range(len(y_vali)):
        if y_vali[i] != 0:
            y_vali[i] = 1
    return x_vali, y_vali
    

def createTestDataSet(path):
    f = open(path, encoding='UTF-8')
    line = f.readline()
    x_test = []
    y_test = []
    qid_test = []
    docid_test = []
    rel_test = []
    while line:
        each_line_list = []
        j = 2
        line_list = line.split()
        for i in range(46):
            each_line_list.append(float(line_list[j].split(":")[1]))
            j = j + 1
        x_test.append(each_line_list)
        y_test.append(int(line_list[0]))
        rel_test.append(int(line_list[0]))
        qid_test.append(int(line_list[1].split(":")[1]))
        docid_test.append(line_list[50].split()[0])
        line = f.readline()
    f.close()
    x_test = np.array(x_test)
    y_test = np.array(y_test)
    rel_test = np.array(rel_test)
    # qid_test = np.array(qid_test)
    for i in range(len(y_test)):
        if y_test[i] != 0:
            y_test[i] = 1
    return x_test, y_test, rel_test, qid_test, docid_test

In [14]:
def sorted_docid_print():
    # prepare for print
    c = clf.predict_log_proba (x_test)
    
    score_list = []
    for i in range(len(c)):
        score_list.append(c[i][1])
    
    qid_docid_score_dict = {}
    
    for i in range(len(c)):
        qid_docid_list = [docid_test[i], score_list[i]]
        if qid_test[i] not in qid_docid_score_dict:
            qid_docid_score_dict[qid_test[i]] = [qid_docid_list]
        else:
            qid_docid_score_dict[qid_test[i]].append(qid_docid_list)

    # print
    for key, value in qid_docid_score_dict.items():
        value.sort(key=lambda x: x[1], reverse=True)
        print("qid: ", key)
        for i in range(len(value)):
            print(value[i][0], '\t', value[i][1])

In [24]:
from sklearn.svm import SVC
from sklearn import metrics

x_train, y_train = createTrainDataSet("MQ2008\\MQ2008\\Fold1\\train.txt")
x_vali, y_vali = createValiDataSet("MQ2008\\MQ2008\\Fold1\\vali.txt")
x_test, y_test, rel_test, qid_test, docid_test = createTestDataSet("MQ2008\\MQ2008\\Fold1\\test.txt")

clf = SVC(gamma='auto', probability=True)
clf.fit(x_train, y_train)
predictions = clf.predict(x_vali)
print("Accuracy on {} vali file is {}".format("Fold1", metrics.accuracy_score(y_vali, predictions)))
# apply the model on test file and print
print("For each qid in Fold1 test file, sort the documents:")
sorted_docid_print()

Accuracy on Fold1 vali file is 0.7905430365718508
For each qid in Fold1 test file, sort the documents:
qid:  18219
GX004-93-7097963 	 -0.09800205807517215
GX016-32-14546147 	 -0.6067423670528554
GX020-25-8391882 	 -1.0976107432265527
GX025-94-0531672 	 -1.6022483681442747
GX026-03-13004845 	 -1.6835744134850332
GX268-53-13016636 	 -1.7592257163064422
GX048-02-13747475 	 -1.8523544429328622
GX010-40-4497720 	 -2.056991464285635
qid:  18230
GX230-44-5924225 	 -0.9467874187560672
GX200-73-5240168 	 -1.0441993357875445
GX061-04-16698930 	 -1.0448255537334858
GX111-60-4484292 	 -1.1362475177005156
GX234-25-8280949 	 -1.352160572128342
GX231-48-9604104 	 -1.4078888082317946
GX056-82-13312051 	 -1.445050435412782
GX001-00-5105044 	 -1.4769906532039212
GX234-41-12695861 	 -1.4792511659375747
GX027-23-15133882 	 -1.4933607452656958
GX265-66-0282836 	 -1.5004216271327748
GX005-40-1622912 	 -1.5062844758790568
GX236-92-10964728 	 -1.5301212869274285
GX004-53-15843752 	 -1.5928020917669856
GX084-2

GX246-59-15840672 	 -1.7577444313047295
GX246-38-7637231 	 -1.7606022098831822
GX233-86-13327098 	 -1.7895959871949165
GX058-57-6092635 	 -1.791967733573414
GX097-58-6801856 	 -1.8111951929520562
GX145-32-2987002 	 -3.2285969582194753
qid:  18985
GX046-79-6984659 	 -0.43649413946320975
GX243-98-10056047 	 -0.9844139681486702
GX117-97-1995412 	 -1.752604090721405
GX106-18-10537656 	 -1.7547108530244602
GX028-54-14412928 	 -1.754836354278585
GX050-44-7083356 	 -1.7680608240732318
GX249-52-14457682 	 -1.7725643439265442
GX023-44-5515463 	 -1.9613603100488155
qid:  18995
GX010-07-9607684 	 -0.5591803453371031
GX002-23-15025545 	 -0.5748628201948326
GX000-17-7249657 	 -0.924171384240896
GX000-40-3033191 	 -0.9295640466572682
GX232-75-8744584 	 -1.0039637363283076
GX041-93-5902198 	 -1.0043307404873378
GX098-32-13542691 	 -1.1029323233323154
GX230-48-11198537 	 -1.1561342241067023
GX007-76-15909681 	 -1.2017306323247294
GX054-34-13142032 	 -1.3136934819797903
GX023-39-9549110 	 -1.3254052638

GX003-46-12533972 	 -0.37949178482902485
GX000-83-15528972 	 -0.3813780139755808
GX013-70-4280225 	 -0.9595861957770782
GX015-73-13249830 	 -1.1780035603863233
GX240-90-0160597 	 -1.3673549377586218
GX000-32-11495797 	 -1.7664569124296579
GX219-81-1425766 	 -1.857457329222392
GX219-83-14052647 	 -1.8944029470952681
GX014-84-9404435 	 -1.8964158417594374
GX047-57-6089538 	 -1.924448637672045
GX038-93-5335884 	 -1.9455885829768083
GX012-88-9820687 	 -1.9745233047726551
GX004-44-12077050 	 -1.9990579415291951
GX030-85-1618831 	 -3.0853593262353867
GX054-49-7637970 	 -3.2197971130085983
qid:  19812
GX011-15-9420865 	 -0.5383556455954485
GX013-41-6443449 	 -0.6217617368621396
GX000-36-13488892 	 -1.5371342874415441
GX030-43-14989934 	 -1.7420188301239636
GX042-63-7295622 	 -1.75759132022499
GX044-95-11375476 	 -1.9351253752367419
GX243-44-3433546 	 -2.0868241071498215
GX261-73-9436793 	 -2.3388846249393254
qid:  19836
GX012-91-16019834 	 -1.74732972484726
GX001-13-6132259 	 -1.7480426328623

In [25]:
x_train, y_train = createTrainDataSet("MQ2008\\MQ2008\\Fold2\\train.txt")
x_vali, y_vali = createValiDataSet("MQ2008\\MQ2008\\Fold2\\vali.txt")
x_test, y_test, rel_test, qid_test, docid_test = createTestDataSet("MQ2008\\MQ2008\\Fold2\\test.txt")

clf = SVC(gamma='auto', probability=True)
clf.fit(x_train, y_train)
predictions = clf.predict(x_vali)
print("Accuracy on {} vali file is {}".format("Fold2", metrics.accuracy_score(y_vali, predictions)))
# apply the model on test file and print
print("For each qid in Fold2 test file, sort the documents:")
sorted_docid_print()

Accuracy on Fold2 vali file is 0.8068893528183716
For each qid in Fold2 test file, sort the documents:
qid:  10002
GX246-16-5503229 	 -0.7853411325559542
GX255-50-7550514 	 -1.5465360309921796
GX229-14-12863205 	 -1.6272513129032626
GX008-86-4444840 	 -1.7269852928544582
GX240-35-2775348 	 -1.7731203845898966
GX044-30-4142998 	 -1.7864664784793116
GX037-06-11625428 	 -1.9164959895031797
GX228-42-3888699 	 -2.1443181040025565
qid:  10032
GX256-43-0740276 	 -1.752258046950845
GX030-77-6315042 	 -1.7653941155596156
GX024-71-0000000 	 -1.773175595228893
GX266-75-11189217 	 -1.7769616491413445
GX140-98-13566007 	 -1.8373591025983647
GX029-35-5894638 	 -1.9359852563484385
GX010-65-7921994 	 -2.248863182468733
GX029-17-16711721 	 -2.3171264340889284
qid:  10035
GX046-28-2590531 	 -0.3444432979397355
GX026-92-0492427 	 -1.158657589994304
GX271-73-0262448 	 -1.7677229568590007
GX031-29-0590777 	 -1.7689882800438728
GX187-61-14052950 	 -1.7723327152748398
GX259-93-1304063 	 -1.794496610617967
GX

GX237-17-8524852 	 -1.768333638623558
GX229-11-4202570 	 -1.7691716653164546
GX243-49-4554177 	 -1.7698373331856638
GX230-52-3932503 	 -1.7709208139567527
GX242-55-4513334 	 -1.7709682537995135
GX259-98-16517236 	 -1.7716790755042573
GX230-80-3081593 	 -1.7717720915028907
GX230-90-16586147 	 -1.7724524387589393
GX251-26-4148937 	 -1.77297400200621
GX031-59-12961673 	 -1.7731977074726153
GX233-80-3104814 	 -1.7734380470497622
GX242-93-5948815 	 -1.7737904234450739
GX001-73-7105934 	 -1.7742188631291897
GX031-64-13264345 	 -1.7746078994475771
GX057-06-11144456 	 -1.7746506068138534
GX037-30-2299986 	 -1.7746690259760813
GX231-20-16400216 	 -1.774947145354073
GX016-18-10762696 	 -1.7749871851073546
GX173-89-9694948 	 -1.7751996659977365
GX079-07-10300458 	 -1.7757018225660464
GX079-08-12615511 	 -1.7758683187028343
GX071-16-7492430 	 -1.77588313306115
GX253-92-13784982 	 -1.7758959537232
GX023-34-14617302 	 -1.775901172214935
GX215-64-7506598 	 -1.7759782308720464
GX025-36-3560893 	 -1.77

GX240-59-0968394 	 -1.7868716730581398
GX272-04-9257062 	 -1.8503154846246985
GX091-94-0734922 	 -1.863272063183781
GX239-22-0993943 	 -1.8639965154084732
GX229-13-6336927 	 -1.868383053585039
GX051-58-0222071 	 -1.8827745301192254
GX027-61-4727029 	 -1.8985028188395399
GX264-19-5725808 	 -1.9132092049748441
GX248-62-5743928 	 -1.9677326231426768
GX033-82-4346038 	 -1.98388655002683
GX216-02-0348057 	 -2.0505311416665752
GX015-32-6474638 	 -2.0591820239894307
GX003-69-14131940 	 -2.091365748569426
GX226-84-8279862 	 -2.182926323764512
GX015-19-3999484 	 -2.2244481783616785
GX124-22-7506760 	 -2.284019579353147
GX053-18-8951599 	 -2.5581119697831625
GX015-19-12783938 	 -2.594219492931057
qid:  11699
GX012-65-5757444 	 -1.7454497676274907
GX005-60-3878348 	 -1.773192937563804
GX252-08-7234050 	 -1.792523890028075
GX069-26-3383465 	 -1.7943578812500742
GX003-75-4044382 	 -1.8005323273494935
GX083-49-0363724 	 -1.8121587461384754
GX034-33-16360772 	 -2.383726503117354
qid:  11725
GX026-25-

In [26]:
x_train, y_train = createTrainDataSet("MQ2008\\MQ2008\\Fold3\\train.txt")
x_vali, y_vali = createValiDataSet("MQ2008\\MQ2008\\Fold3\\vali.txt")
x_test, y_test, rel_test, qid_test, docid_test = createTestDataSet("MQ2008\\MQ2008\\Fold3\\test.txt")

clf = SVC(gamma='auto', probability=True)
clf.fit(x_train, y_train)
predictions = clf.predict(x_vali)
print("Accuracy on {} vali file is {}".format("Fold3", metrics.accuracy_score(y_vali, predictions)))
# apply the model on test file and print
print("For each qid in Fold3 test file, sort the documents:")
sorted_docid_print()

Accuracy on Fold3 vali file is 0.7896351858165701
For each qid in Fold3 test file, sort the documents:
qid:  11909
GX062-53-0946803 	 -0.2968733068475248
GX239-51-1457804 	 -1.268121182829309
GX227-72-3223024 	 -1.6930871706655084
GX043-50-8139281 	 -1.7210277494736346
GX000-01-8011551 	 -1.767787942866075
GX262-67-1741687 	 -1.7716110144994393
GX012-13-11604073 	 -2.673063440962242
GX036-18-10002856 	 -2.8224267959228087
qid:  11935
GX005-19-11283495 	 -0.12896396464394239
GX003-85-8789820 	 -1.7520470836068807
GX119-64-7712247 	 -1.7536351118873374
GX272-62-5268782 	 -1.7586822992357005
GX012-98-0454654 	 -1.7605647708909615
GX068-28-14654603 	 -1.7654607703353768
GX234-79-10440929 	 -1.7880537980946372
GX233-51-6971727 	 -3.006257519949145
qid:  11936
GX033-17-3277233 	 -1.7482173735039834
GX256-12-16249508 	 -1.7501390921891622
GX041-68-3445739 	 -1.7528956048708713
GX182-56-1410214 	 -1.7536310291826036
GX244-39-4607145 	 -1.7536733565131757
GX181-24-14116117 	 -1.753937916102811


GX114-35-6014260 	 -1.755411691078379
GX050-72-15810231 	 -1.757276501579235
GX241-85-1971116 	 -1.761548795933988
GX262-12-2656127 	 -1.7625391156221555
GX049-56-3151019 	 -1.7628783894950402
GX153-45-5966351 	 -1.7654701906889936
GX034-02-6703550 	 -1.7665772731902494
GX218-05-3263352 	 -1.7673616844954856
GX011-99-3040004 	 -1.7693917253552838
GX029-72-4008543 	 -1.770214885908237
GX062-10-14611847 	 -1.7718471691178177
GX107-09-7353836 	 -1.7721417928366567
GX104-20-6303486 	 -1.7727073481394289
GX070-00-9388561 	 -1.772733435233913
GX138-58-1249081 	 -1.7729120423404525
GX011-67-13082028 	 -1.773344291540572
GX065-07-1260970 	 -1.7735612504590017
GX003-25-6499972 	 -1.7737680455090052
GX018-50-13278873 	 -1.773923121406888
GX000-12-12356259 	 -1.7755695118508208
GX003-37-4143581 	 -1.7770266582165541
GX018-71-14300072 	 -1.7783632692223992
GX055-41-12201335 	 -1.778494389373083
GX032-79-10505072 	 -1.779529919992754
GX000-70-15604688 	 -1.7797855015522075
GX087-54-11365538 	 -1.78

GX253-65-8976435 	 -1.8868257464253442
GX015-51-5606998 	 -1.8920868871544652
GX076-01-3565724 	 -1.9352405048685477
GX024-89-4917139 	 -1.9379822032572238
GX024-44-15209620 	 -2.0368968782060035
GX040-91-4321091 	 -2.03784441535817
GX005-86-12932445 	 -2.067863235827604
GX008-65-10617878 	 -2.2016080253450654
GX052-26-11124662 	 -2.273564883878346
qid:  13383
GX032-97-8072526 	 -0.05949963493326088
GX011-64-12087248 	 -0.4079812538693793
GX243-10-1339331 	 -1.7465449569023273
GX015-61-8546200 	 -1.7523243396720796
GX054-28-7157401 	 -1.7793416395825863
GX058-41-6076811 	 -1.900652439216136
GX025-48-13154579 	 -1.932870595007456
GX020-08-1025049 	 -2.782018827552598
qid:  13388
GX001-51-6055813 	 -0.049770468799726866
GX038-94-5315600 	 -0.9843236830267792
GX073-42-1911121 	 -1.749291580898367
GX000-93-3761073 	 -1.7514449149765046
GX104-14-0460929 	 -1.7539015847254118
GX065-94-14784575 	 -1.7579924181253248
GX091-53-1175476 	 -1.7598267255476585
GX260-50-5018686 	 -1.7617192109988709

In [27]:
x_train, y_train = createTrainDataSet("MQ2008\\MQ2008\\Fold4\\train.txt")
x_vali, y_vali = createValiDataSet("MQ2008\\MQ2008\\Fold4\\vali.txt")
x_test, y_test, rel_test, qid_test, docid_test = createTestDataSet("MQ2008\\MQ2008\\Fold4\\test.txt")

clf = SVC(gamma='auto', probability=True)
clf.fit(x_train, y_train)
predictions = clf.predict(x_vali)
print("Accuracy on {} vali file is {}".format("Fold4", metrics.accuracy_score(y_vali, predictions)))
# apply the model on test file and print
print("For each qid in Fold4 test file, sort the documents:")
sorted_docid_print()

Accuracy on Fold4 vali file is 0.8473177441540578
For each qid in Fold4 test file, sort the documents:
qid:  14037
GX251-96-8493810 	 -1.6828368946213217
GX081-69-1048268 	 -1.6955182529774087
GX004-60-9416940 	 -1.7000883567707168
GX228-02-10524405 	 -1.7014952469051463
GX257-67-14110887 	 -1.7066762102255448
GX091-49-3018556 	 -1.7091450674207471
GX262-10-8542836 	 -1.7107606634614594
GX266-81-7615384 	 -1.7109773448324386
GX039-05-10560983 	 -1.713672614863713
GX025-90-14243696 	 -1.720908046868682
GX054-15-15832567 	 -1.7228580634275883
GX121-68-4625340 	 -1.7288155336840316
GX002-12-9743659 	 -1.7309394793625033
GX259-75-11830348 	 -1.8272116007030101
GX000-07-4584575 	 -1.998097096378467
GX062-19-7355440 	 -2.18136332122247
qid:  14043
GX256-82-12690962 	 -0.20067927015559056
GX026-71-8720902 	 -0.7162678883752459
GX041-25-6862082 	 -0.7575882345936867
GX001-90-15222563 	 -0.9171113846153643
GX026-91-0752750 	 -1.0018073166150752
GX008-01-4642180 	 -1.0463914084261052
GX003-03-14

GX041-68-3445739 	 -1.7213394973850018
GX264-65-15303409 	 -1.7310825948649216
GX015-26-5056899 	 -1.8712178015743766
GX114-86-13709886 	 -1.9704419980286818
GX006-68-2702389 	 -2.2023317065493195
GX045-88-2815345 	 -2.3492857350112484
GX241-83-15437879 	 -2.4924609560072657
GX231-80-2288373 	 -2.5473643244005277
GX216-79-9685748 	 -2.561291746655249
qid:  14724
GX037-76-5798163 	 -0.41865360544501495
GX006-23-10880593 	 -0.6282665360622591
GX001-96-7606907 	 -0.6931471805599453
GX001-57-10486467 	 -1.359113166867173
GX256-44-6357544 	 -1.3719655338424674
GX004-59-8847810 	 -1.3734099535442952
GX193-43-7717037 	 -1.4438906817202826
GX263-53-6047236 	 -1.4875808123113146
GX002-53-8753231 	 -1.5029591755141416
GX046-68-5608729 	 -1.6679786165868138
GX264-55-16328787 	 -1.695228091772176
GX000-40-8731369 	 -1.721192771874468
GX020-19-14748446 	 -1.7240508170179238
GX032-54-2890351 	 -1.7250369948844955
GX042-62-7741882 	 -1.7286449737112952
GX018-65-2222122 	 -1.735251402123464
qid:  1473

GX000-21-6111461 	 -1.7276755369840684
GX010-12-4962685 	 -1.7283829028301867
GX228-08-14477498 	 -1.7288658943976984
GX001-12-7406471 	 -1.7292573430333784
GX242-81-11911071 	 -1.730170263993106
GX183-91-16246019 	 -1.7304703878188288
GX038-22-2199858 	 -1.731742504693065
GX150-78-11540434 	 -1.7318344079098962
GX252-52-5624886 	 -1.7330134222143485
GX062-94-15297309 	 -1.7335371788463598
GX265-08-4778851 	 -1.7343144261222814
GX242-00-15486125 	 -1.7351369596840889
GX000-71-16390762 	 -1.7352852578140314
GX162-04-10229619 	 -1.7373875106200296
GX017-76-7693929 	 -1.7376105675748865
GX094-46-8355518 	 -1.7389129562001633
GX167-66-4010588 	 -1.7415325504513834
GX262-12-5058258 	 -1.7432176309033756
GX229-01-15168492 	 -1.743319818444529
GX183-18-0782880 	 -1.7439326014522638
GX181-16-2796549 	 -1.748968807438807
GX256-31-16256810 	 -1.7517352759383558
GX233-78-1844810 	 -1.7521915310453975
GX001-03-16054730 	 -1.7526932063707614
GX240-75-2163226 	 -1.7564826159728504
GX007-93-1248102 	

In [28]:
x_train, y_train = createTrainDataSet("MQ2008\\MQ2008\\Fold5\\train.txt")
x_vali, y_vali = createValiDataSet("MQ2008\\MQ2008\\Fold5\\vali.txt")
x_test, y_test, rel_test, qid_test, docid_test = createTestDataSet("MQ2008\\MQ2008\\Fold5\\test.txt")

clf = SVC(gamma='auto', probability=True)
clf.fit(x_train, y_train)
predictions = clf.predict(x_vali)
print("Accuracy on {} vali file is {}".format("Fold5", metrics.accuracy_score(y_vali, predictions)))
# apply the model on test file and print
print("For each qid in Fold5 test file, sort the documents:")
sorted_docid_print()

Accuracy on Fold5 vali file is 0.7916394513389942
For each qid in Fold5 test file, sort the documents:
qid:  15928
GX034-58-10113712 	 -1.7677830349689898
GX034-49-8740899 	 -1.773974224459306
GX242-36-9238719 	 -1.7755702684953691
GX043-30-13103572 	 -1.7761751685728688
GX015-44-4118282 	 -1.7773638641504295
GX069-07-14164433 	 -1.7791237519133536
GX257-32-13973035 	 -1.7824505933916823
GX186-72-15749240 	 -1.8050974975583955
GX052-67-5008443 	 -1.8159820114225256
GX229-00-6560972 	 -1.821593688777569
GX247-57-16060509 	 -2.21406675977271
GX033-03-4749959 	 -2.2472416023255586
GX060-74-0065456 	 -2.3456129217321022
GX074-59-6685405 	 -2.5626062902411872
GX068-98-13190287 	 -2.97211590043955
qid:  15948
GX004-98-10572354 	 -0.29603779648247164
GX007-33-14413973 	 -1.7575869683950547
GX021-89-8634221 	 -1.7644257257422598
GX007-69-7935665 	 -1.7714951113876451
GX055-65-1781471 	 -1.7877379394253565
GX008-88-0644868 	 -1.8047308614340773
GX024-24-2041354 	 -1.93793268677394
GX210-38-3668

GX066-13-2831812 	 -1.7840042442727808
GX043-04-7060348 	 -1.7842926443872067
GX256-60-2788323 	 -1.7853251790650733
GX266-69-15914843 	 -1.7856785723247413
GX006-18-8115779 	 -1.7989498768776189
GX088-39-2646370 	 -1.8003879055155498
GX084-79-15445651 	 -1.8010991561679965
GX058-51-10464771 	 -1.808044322335406
GX074-02-4092992 	 -1.8146879125116548
GX074-05-5222272 	 -1.8150673496455638
GX053-43-7140215 	 -1.81940426359578
GX260-19-7722530 	 -1.8218466262090913
GX001-93-8449800 	 -1.8307421117144564
GX026-33-2236466 	 -1.831430556826028
GX039-78-12712045 	 -1.855836350712651
GX243-48-14222592 	 -1.8985341291900117
GX074-05-7103499 	 -1.90633602137685
GX107-49-3101300 	 -1.9152741946681895
GX052-24-15187350 	 -1.9415305742911335
GX063-07-8435746 	 -1.9443379402197412
GX077-29-9367081 	 -1.972514899612603
GX019-87-13079738 	 -2.027499090809352
GX263-00-9594563 	 -2.058105111841102
GX057-86-7461750 	 -2.070603646411165
GX067-66-2790379 	 -2.0813854223297015
GX067-69-8341994 	 -2.0818836

GX225-77-2103728 	 -1.3849439902320126
GX023-65-12876845 	 -1.4211727246098103
GX069-95-13734813 	 -1.7106103871535754
GX041-31-4958011 	 -1.7828486946452153
GX076-22-7321165 	 -1.7912484752262245
GX043-61-8505439 	 -1.868738089996441
GX064-79-13110009 	 -2.0535362736362006
qid:  18087
GX000-13-13401973 	 -1.46895364942809
GX257-95-4551723 	 -1.7075829787832668
GX229-51-0008016 	 -1.7805664615123007
GX015-84-10068629 	 -1.8017481490617138
GX227-02-13195623 	 -1.802896646674895
GX114-33-5433518 	 -1.8074702274060936
GX015-19-12783938 	 -1.8129651489185854
GX000-53-13956386 	 -1.8433096569139573
GX240-99-7751826 	 -1.8605559821054922
GX089-91-15170825 	 -1.9060303635673301
GX055-35-6005648 	 -1.9577392506058975
GX175-01-14073186 	 -2.099118626034878
GX007-26-13423700 	 -2.4991172789325278
GX249-83-8648324 	 -2.532872923793658
GX024-52-13941844 	 -2.939525719241988
qid:  18105
GX236-03-13941705 	 -0.4826620683008803
GX010-60-5292335 	 -1.7105563936489065
GX000-02-1027347 	 -1.716600077216

## Part 2.2: NDCG (20 points)

Based on your prediction file (results could be ranked by scores in the prediction file) and ground-truth (i.e., 0,1,2) in the test file, calculate NDCG for each query. Report average NDCG for all queries in the five-fold cross validation.

For NDCG, please bulid your own function rather then using any package.

In [29]:
# your code here
def NDCG():
    c = clf.predict_log_proba (x_test)
    score_list = []
    for i in range(len(c)):
        score_list.append(c[i][1])
    qid_docid_score_dict = {}
    for i in range(len(c)):
        qid_docid_list = [docid_test[i], score_list[i], rel_test[i]]
        if qid_test[i] not in qid_docid_score_dict:
            qid_docid_score_dict[qid_test[i]] = [qid_docid_list]
        else:
            qid_docid_score_dict[qid_test[i]].append(qid_docid_list)
    sum_NDCG = 0
    for key, value in qid_docid_score_dict.items():
        value.sort(key=lambda x: x[1], reverse=True)
    #     print(key)
    #     for i in range(len(value)):
    #         print(value[i][0], '\t', value[i][1], '\t', value[i][2])
        DCG = 0
        if len(value) >= 10:
            for j in range(10):
                DCG += value[j][2]/np.log2(j+2)
        else:
            for j in range(len(value)):
                DCG += value[j][2]/np.log2(j+2)
        # IDCG
        value.sort(key=lambda x: x[2], reverse=True)
        IDCG = 0
        for j in range(len(value)):
                IDCG += value[j][2]/np.log2(j+2)
        # NDCG
        if IDCG == 0:
            NDCG = 0
        else:
            NDCG = DCG/IDCG
        sum_NDCG += NDCG
        
    average_NDCG = sum_NDCG/len(qid_docid_score_dict)
    return average_NDCG

In [30]:
x_train, y_train = createTrainDataSet("MQ2008\\MQ2008\\Fold1\\train.txt")
x_test, y_test, rel_test, qid_test, docid_test = createTestDataSet("MQ2008\\MQ2008\\Fold1\\test.txt")

clf = SVC(gamma='auto', probability=True)
clf.fit(x_train, y_train)
Fold1_average_NDCG = NDCG()
print("Fold1_average_NDCG: ", Fold1_average_NDCG)

Fold1_average_NDCG:  0.431112388577871


In [31]:
x_train, y_train = createTrainDataSet("MQ2008\\MQ2008\\Fold2\\train.txt")
x_test, y_test, rel_test, qid_test, docid_test = createTestDataSet("MQ2008\\MQ2008\\Fold2\\test.txt")

clf = SVC(gamma='auto', probability=True)
clf.fit(x_train, y_train)
Fold2_average_NDCG = NDCG()
print("Fold2_average_NDCG: ", Fold2_average_NDCG)

Fold2_average_NDCG:  0.433529582537733


In [32]:
x_train, y_train = createTrainDataSet("MQ2008\\MQ2008\\Fold3\\train.txt")
x_test, y_test, rel_test, qid_test, docid_test = createTestDataSet("MQ2008\\MQ2008\\Fold3\\test.txt")

clf = SVC(gamma='auto', probability=True)
clf.fit(x_train, y_train)
Fold3_average_NDCG = NDCG()
print("Fold3_average_NDCG: ", Fold3_average_NDCG)

Fold3_average_NDCG:  0.4538636734023412


In [33]:
x_train, y_train = createTrainDataSet("MQ2008\\MQ2008\\Fold4\\train.txt")
x_test, y_test, rel_test, qid_test, docid_test = createTestDataSet("MQ2008\\MQ2008\\Fold4\\test.txt")

clf = SVC(gamma='auto', probability=True)
clf.fit(x_train, y_train)
Fold4_average_NDCG = NDCG()
print("Fold4_average_NDCG: ", Fold4_average_NDCG)

Fold4_average_NDCG:  0.5022472279652941


In [34]:
x_train, y_train = createTrainDataSet("MQ2008\\MQ2008\\Fold5\\train.txt")
x_test, y_test, rel_test, qid_test, docid_test = createTestDataSet("MQ2008\\MQ2008\\Fold5\\test.txt")

clf = SVC(gamma='auto', probability=True)
clf.fit(x_train, y_train)
Fold5_average_NDCG = NDCG()
print("Fold5_average_NDCG: ", Fold5_average_NDCG)

Fold5_average_NDCG:  0.5091206325695353


In [35]:
sum_NDCG = Fold1_average_NDCG + Fold2_average_NDCG + Fold3_average_NDCG + Fold4_average_NDCG + Fold5_average_NDCG
NDCG = sum_NDCG / 5
print("NDCG in the whole dataset: ", NDCG)

NDCG in the whole dataset:  0.4659747010105549


## (BONUS) Pairwise Learning to Rank (5 points)

Rather than use the point-wise approach as in Part 2.1, instead try to implement a paiwise approach.

In [8]:
# your code here

## Collaboration declarations

*If you collaborated with anyone (see Collaboration policy at the top of this homework), you can put your collaboration declarations here.*