#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020


# Homework 2:  PageRank + Learning to Rank

### 100 points [10% of your final grade]

### Due: March 5, 2020 by 11:59pm

*Goals of this homework:* In this homework you will explore real-world challenges of building a graph (in this case, from tweets), implement and test the classic PageRank algortihm over this graph. In addition, you will apply learning to rank to a real-world dataset and report the performance in terms of NDCG.

*Submission instructions (eCampus):* To submit your homework, rename this notebook as `UIN_hw2.ipynb`. For example, my homework submission would be something like `555001234_hw2.ipynb`. Submit this notebook via eCampus (look for the homework 2 assignment there). Your notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so). So please run all the cells for us, and then submit.

*Late submission policy:* For this homework, you may use as many late days as you like (up to the 5 total allotted to you).

*Collaboration policy:* You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by **filling out the Collaboration Declarations at the bottom of this notebook**. 

*Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.*

The basic rule is that no student should explicitly share a solution with another student (and thereby circumvent the basic learning process), but it is okay to share general approaches, directions, and so on. If you feel like you have an issue that needs clarification, feel free to contact either me or the TA.

# Part 1: PageRank (60 points)
In this assignment, we're going to adapt the classic PageRank approach to allow us to find not the most authoritative web pages, but rather to find significant Twitter users. 


## Part 1.1: A re-Tweet Graph (20 points)

So, instead of viewing the world as web pages with hyperlinks (where pages = nodes, hyperlinks = edges), we're going to construct a graph of Twitter users and their retweets of other Twitter users (so user = node, retweet of another user = edge). Over this Twitter-user graph, we can apply the PageRank approach to order the users. The main idea is that a user who is retweeted by other users is more "impactful". 

Here is a toy example. Suppose you are given the following four retweets:

* **userID**: diane, **text**: "RT ", **sourceID**: bob
* **userID**: charlie, **text**: "RT Welcome", **sourceID**: alice
* **userID**: bob, **text**: "RT Hi ", **sourceID**: diane
* **userID**: alice, **text**: "RT Howdy!", **sourceID**: parisa

There are four short tweets retweeted by four users. The retweet between users form a directed graph with five nodes and four edges. E.g., the "diane" node has a directed edge to the "bob" node.

You should build a graph by parsing the tweets in the file we provide called *PageRank.json*.

**Notes:**

* You may see some weird characters in the content of tweets, just ignore them. 
* The edges are binary and directed. If Bob retweets Alice once, in 10 tweets, or 10 times in one tweet, there is an edge from Bob to Alice, but there is not an edge from Alice to Bob.
* If a user retweets herself, ignore it.
* Correctly parsing screen_name in a tweet is error-prone. Use the id of the user (this is the user who is re-tweeting) and the id of the user in the retweeted_status field (this is the user who is being re-tweeted; that is, this user created the original tweet).
* Later you will need to implement the PageRank algorithm on the graph you build here.


In [1]:
# Here define your function for building the graph by parsing 
# the input file of tweets
# Insert as many cells as you want

In [48]:
# Here define your function for building the graph 
# by parsing the input file 
# Insert as many cells as you want
import numpy as np
import json
import itertools
import operator
import csv
import pandas as pd
from sklearn.svm import SVC
import warnings
from collections import defaultdict
import math
from statistics import mean 

In [49]:
def FetchData(filename):
    f = open(filename, encoding = "utf8")
            
    return f

In [50]:
#Build Graph Function to build a graph having tweet connections

f = FetchData('HITS.json')

def buildGraph():
    twitterRelation = []
    result = []
    i =0
    for item in f:
        data = json.loads(item)
        my_dict={}
        my_dict['id']=data.get('user').get('id')
        my_dict['retweeted user']=data.get('retweeted_status').get('user').get('id')
        result.append(my_dict)
        temp = [result[i].get('id'),result[i].get('retweeted user')]  
        i += 1
        twitterRelation.append(temp)

    matrix = np.array(twitterRelation)
    nodes = np.unique(matrix)
    nodeindex = {n: i for i, n in enumerate(nodes)}
    n = nodes.size
    graph = np.zeros((n, n))
    numdata = np.vectorize(nodeindex.get)(matrix)
    check1 =[]
    e=0 
    ne =0
    for t, h in numdata:
        check = {"tail": t, "head": h}
        if check in check1:
            ne += 1
        else:
            check1.append(check)
            e += 1
      
        graph[t, h] = 1
     
    return graph, n, e, nodes

In [51]:
graph,n,edges, nodes = buildGraph()

In [52]:
def printSizeValues():
    print("{} : {}".format("Total number of nodes in the graph are", n))
    print("{} : {}".format("Total number of edges in the graph are", edges))

In [53]:
printSizeValues()

Total number of nodes in the graph are : 1003
Total number of edges in the graph are : 6177


In [11]:
# Call your function to print out the size of the graph, 
# i.e., the number of nodes and edges
# How you maintain the graph is totaly up to you
# However, if you encounter any memory issues, we recommend you 
#write the graph into a file, and load it later.

We will not check the correctness of your graph. However, this will affect the PageRank results later.

## Part 1.2: PageRank Implementation (30 points)

Your program will return the top 10 users with highest PageRank scores. The **output** should be like:

* user1 - score1
* user2 - score2
* ...
* user10 - score10

You should follow these **rules**:

* Assume all nodes start out with equal probability.
* The probability of the random surfer teleporting is 0.1 (that is, the damping factor is 0.9).
* If a user is never retweeted and does not retweet anyone, their PageRank scores should be zero. Do not include the user in the calculation.
* It is up to you to decide when to terminate the PageRank calculation.
* There are PageRank implementations out there on the web. Remember, your code should be **your own**.


**Hints**:
* If you're using the matrix style approach, you should use [numpy.matrix](https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html).
* Scipy is built on top of Numpy and has support for sparse matrices. You most likely will not need to use Scipy unless you'd like to try out their sparse matrices.
* If you choose to use Numpy (and Scipy), please make sure your Anaconda environment include their latest versions.
* Test your parsing and PageRank calculations using a handful of tweets, before moving on to the entire file we provide.
* We will evaluate the user ranks you provide as well as the quality of your code. So make sure that your code is clear and readable.

What is the termination condition in your PageRank implementation? Describe it below:

*ADD YOUR ANSWER HERE*

In [162]:
# Here add your code to implement a function called PageRanker
# Insert as many cells as you want

# def PageRanker(...):
#    ...
np.seterr(divide='ignore', invalid='ignore')
def PageRanker(C,nodes, maxerr = .000001):
    n = C.shape[0] #5
    matrix = np.zeros((n, n))            
    initial_value = 1 / n
    ranks =np.ones((n,1)).dot(initial_value) # initial matrix(1003 x 1)   
    n_iterations = 0
    delta = 1.0
    newRank= np.zeros((n,1))
    for i in range(0,n):
        matrix[i]=np.divide(C[i],sum(C[i]))
        if np.any(np.isnan(matrix[i])) == True :
            matrix[i]=np.add(C[i],(1/len(C[i])))
                        
    new_matrix = matrix.transpose()
    teleport = np.ones((n,n))
    H = new_matrix*(0.9) + teleport*((1-0.9)/n)
       
        
    while delta > maxerr:
        newRank = H.dot(ranks)
       
        n_iterations += 1
        delta = sum(abs(newRank[node] - ranks[node]) for node in range(0,n))
        ranks = newRank
    
    out=[]
    for j in range(len(newRank)):
        a = newRank[j]
        out.append(a[0])
    
    
    dicts = {}
    keys = nodes
    values = out
    
    for i in range(len(keys)):
        dicts[keys[i]] = values[i]    
    
    
    sorted_d = sorted(dicts.items(), key=operator.itemgetter(1), reverse=True)
   
    return sorted_d[:10], n_iterations
        

In [163]:
# Now let's call your function on the graph you've built. Output the results.
pageranks, iterations  = PageRanker(graph,nodes)


In [164]:
print("PAGERANK OUTPUT:")
print("    PAGE          USER RANK")
for i in pageranks:
    print(i)
print()    
print("TOTAL ITERATIONS TO COMPUTE PAGERANK VALUES:" , iterations)


PAGERANK OUTPUT:
    PAGE          USER RANK
(1183906148, 0.02842143948679257)
(3019659587, 0.021570050798506712)
(3077695572, 0.02093852100565467)
(3068694151, 0.01850008734487332)
(2598548166, 0.017636724442521616)
(3154266823, 0.01738327797903532)
(571198546, 0.01729265299623784)
(3042570996, 0.017220262192581017)
(3039321886, 0.015622627381174118)
(3082766914, 0.014498603525010737)

TOTAL ITERATIONS TO COMPUTE PAGERANK VALUES: 38


## Part 1.3: Improving PageRank (10 points)
In the many years since PageRank was introduced, there have been many improvements and extensions. For this part, you should experiment with one such improvement and then compare the results you get with the original results in Part 1.2. 

In [65]:
# Here add your code
np.seterr(divide='ignore', invalid='ignore')
def ImprovedPageRanker1(C,nodes, maxerr = .000001):
    
    n = C.shape[0]
    matrix = np.zeros((n, n))            
    initial_value = 1 / n
    ranks =np.ones((n,1)).dot(initial_value) # initial matrix(1003 x 1)   
    n_iterations = 0
    delta = 1.0
    newRank= np.zeros((n,1))
    for i in range(0,n):
        matrix[i]=np.divide(C[i],sum(C[i]))
        if np.any(np.isnan(matrix[i])) == True :
            matrix[i]=np.add(C[i],(1/len(C[i])))
            
    new_matrix = matrix.transpose()
    teleport = np.ones((n,n))
    H = new_matrix*(1-0.9) + teleport*((0.9)/n)
        
        
    while delta > maxerr:
        
        newRank = H.dot(ranks)
        
        mean_value = sum(newRank)/len(newRank)
        newr = newRank/(math.sqrt(mean_value))
        n_iterations += 1
        delta = sum(abs(newRank[node] - ranks[node]) for node in range(0,n))
        ranks = newr
   
    out=[]
    for j in range(len(newRank)):
        a = newRank[j]
        out.append(a[0])
    
  
    dicts = {}
    keys = nodes
    values = out
    
    for i in range(len(keys)):
        dicts[keys[i]] = values[i]    
    
    
    sorted_d = sorted(dicts.items(), key=operator.itemgetter(1), reverse=True)
   
    return sorted_d[:10], n_iterations
 
    # this is the revised algorithm, we are taking mean and normalising values
    #itertions have reduced and pages differ after changing H matrix

In [66]:
Ipageranks, Iiterations  = ImprovedPageRanker1(graph,nodes)

In [100]:
print("IMPROVED PAGERANK OUTPUT 1:")
print("    PAGE          USER RANK")
for i in Ipageranks:
    print(i)
print()    
print("TOTAL ITERATIONS TO COMPUTE PAGERANK VALUES:" , Iiterations)


IMPROVED PAGERANK OUTPUT 1:
    PAGE          USER RANK
(3039321886, 3.3097731245742077)
(571198546, 3.28615128332632)
(1183906148, 3.2079643141947143)
(1638625987, 3.1854343483904914)
(3042570996, 2.9523969456686885)
(3019659587, 2.856835326849231)
(3077695572, 2.833448377570785)
(3154266823, 2.741560176947009)
(1358345766, 2.6515802801212423)
(3068694151, 2.4004042693293273)

TOTAL ITERATIONS TO COMPUTE PAGERANK VALUES: 8


In [None]:
'''
IMPROVED PAGERANK 1 

Source : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.3.3964&rep=rep1&type=pdf

In this improved version, I have implemented the following formulae from the paper cited above to calculate my H matrix:
 M' = (1-D)*M + D*[1/N]nxn where D is the damping factor and N total number of nodes
 
 Additionally, in my while loop, I have taken the mean value of calculated pageranks and normalised my new_pagerank
 with the mean value before updating the rank vector.
 
 The following changes were observed after applying these changes:
 1) Total number of iterations required to compute pagerank reduced with the improved approach.
 2) Pageranks returned with highest relevance differed with different pagerank values.

'''

In [93]:
from scipy.sparse.linalg import eigs
np.seterr(divide='ignore', invalid='ignore')
def ImprovedPageRanker2(C,nodes, maxerr = .000001):
    n = C.shape[0] #5
    matrix = np.zeros((n, n))            
    initial_value = 1 / n
    ranks =np.ones((n,1)).dot(initial_value) # initial matrix(1003 x 1)   
    n_iterations = 0
    delta = 1.0
    newRank= np.zeros((n,1))
    for i in range(0,n):
        matrix[i]=np.divide(C[i],sum(C[i]))
        if np.any(np.isnan(matrix[i])) == True :
            matrix[i]=np.add(C[i],(1/len(C[i])))
            
    new_matrix = matrix.transpose()
    teleport = np.ones((n,n))
    H = new_matrix*(0.9) + teleport*((1-0.9)/n)
   
    
    vals, vecs = eigs(np.array(new_matrix), k=1)
    
   
    eigenvector = []
    for vec in vecs:
        eigenvector.append(vec[0])

    
    dicts = {}
    keys = nodes
    values = eigenvector
    
    for i in range(len(keys)):
        dicts[keys[i]] = values[i]    
    
    
    sorted_d = sorted(dicts.items(), key=operator.itemgetter(1), reverse=True)
   
    return sorted_d[:10], n_iterations
 

In [96]:
I2pageranks, I2iterations = ImprovedPageRanker2(graph,nodes)

In [99]:
print("IMPROVED PAGERANK OUTPUT 2:")
print("    PAGE          USER RANK")
for i in I2pageranks:
    print(i)
print()    


PAGERANK OUTPUT:
    PAGE          USER RANK
(1183906148, (0.372988137447679+0j))
(2598548166, (0.23940538393235555+0j))
(3019659587, (0.23343298622775943+0j))
(3077695572, (0.22922291560492608+0j))
(3154266823, (0.20421683088770742+0j))
(3068694151, (0.2032681066484863+0j))
(3042570996, (0.2032485430333665+0j))
(3264645911, (0.1812099104137282+0j))
(3082766914, (0.1792897095035943+0j))
(571198546, (0.1719584073564121+0j))



In [6]:
# Plus be sure to describe your extension (what is it? 
# why did you choose it?) and your comparison to Part 1.2
'''
IMPROVED PAGERANK 2

In this improved version, I have computed the nomalised matrix from graph and applied it to eigs function to
generate eigenvectors. The eignvectors generated will be the pageranks generated.

 The following changes were observed after applying these changes:
 1) Total number of iterations required to compute pagerank were nil with the improved approach as we dont
    need to iterate through to generate eigenvectors.
 2) Pageranks returned with highest relevance differed with different pagerank values.

'''

# Part 2: Learning to Rank (40 points)

For this part, we're going to play with some Microsoft LETOR data that has query-document relevance judgments. Let's see how learning to rank works in practice. 

First, you will need to download the MQ2008.zip file from the Resources tab on Piazza. This is data from the [Microsoft Research IR Group](https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/).

The data includes 15,211 rows. Each row is a query-document pair. The first column is a relevance label of this pair (0,1 or 2--> the higher value the more related), the second column is query id, the following columns are features, and the end of the row is comment about the pair, including id of the document. A query-document pair is represented by a 46-dimensional feature vector. Features are a numeric value describing a document and query such as TFIDF, BM25, Page Rank, .... You can find compelete description of features from [here](https://arxiv.org/ftp/arxiv/papers/1306/1306.2597.pdf).

The good news for you is the dataset is ready for analysis: It has already been split into 5 folds (see the five folders called Fold1, ..., Fold5).


## Part 2.1: Build Point-wise Learning to Rank  (20 points)
First, you should build a point-wise Learning to Rank framework. 
1. You could train a binary classification model like SVM or logistic regression on the train file. In this case, 0 is treated as negative (irrelevant) sample and 1, 2 are treated as positive (relevant) sample.
2. You apply the already trained model to predict the scores for documents on test file.
3. Order the documents based on the scores.

add your results and discussion here

In [101]:
#define file names

#Train files
train_txt_1 = "train1.txt"
train_csv_1 = "trainc1.csv"
train_txt_2 = "train2.txt"
train_csv_2 = "trainc2.csv"
train_txt_3 = "train3.txt"
train_csv_3 = "trainc3.csv"
train_txt_4 = "train4.txt"
train_csv_4 = "trainc4.csv"
train_txt_5 = "train5.txt"
train_csv_5 = "trainc5.csv"

#Test Files
test_txt_1 = "test1.txt"
test_csv_1 = "testc1.csv"
test_txt_2 = "test2.txt"
test_csv_2 = "testc2.csv"
test_txt_3 = "test3.txt"
test_csv_3 = "testc3.csv"
test_txt_4 = "test4.txt"
test_csv_4 = "testc4.csv"
test_txt_5 = "test5.txt"
test_csv_5 = "testc5.csv"

#Validation Files
vali_txt_1 = "vali1.txt"
vali_csv_1 = "valic1.csv"
vali_txt_2 = "vali2.txt"
vali_csv_2 = "valic2.csv"
vali_txt_3 = "vali3.txt"
vali_csv_3 = "valic3.csv"
vali_txt_4 = "vali4.txt"
vali_csv_4 = "valic4.csv"
vali_txt_5 = "vali5.txt"
vali_csv_5 = "valic5.csv"

In [102]:
# your code here
def parse_data(inputf,outputf):
    qid = []
    docid = []
    original_labels = []
    rows=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,'label']
    
    with open(outputf, 'w+') as csvFile:
        writer = csv.writer(csvFile) 
        writer.writerow(rows)
        
        for f in open(inputf,"r"): 
            rowval=[]
            values=f.split()
            r=2;
            
            for i in range(1,47): 
                if (int(values[r].split(":")[0]))!=i: 
                    rowval.append("")
                    continue
                rowval.append(values[r].split(":")[1])
                r += 1 
                
            temp = int(values[0])
            original_labels.append(temp)
            
            if (temp==1) or (temp==2):
                temp = 1
                
            rowval.append(temp)
            qid.append(int(values[1].split(":")[1]))
            docid.append(values[50])
            writer.writerow(rowval)
            
    csvFile.close()
    return qid,docid,original_labels

In [103]:
#Fetch Query Ids and DocIds

qid1 , docid1 , org1 = parse_data(train_txt_1,train_csv_1)
qid2 , docid2 , org2  = parse_data(train_txt_2,train_csv_2)
qid3 , docid3,  org3 = parse_data(train_txt_3,train_csv_3)
qid4 , docid4,  org4 = parse_data(train_txt_4,train_csv_4)
qid5 , docid5,  org5 = parse_data(train_txt_5,train_csv_5)

tqid1 , tdocid1, torg1 = parse_data(test_txt_1,test_csv_1)
tqid2 , tdocid2,torg2 = parse_data(test_txt_2,test_csv_2)
tqid3 , tdocid3,torg3 = parse_data(test_txt_3,test_csv_3)
tqid4 , tdocid4,torg4 = parse_data(test_txt_4,test_csv_4)
tqid5 , tdocid5,torg5 = parse_data(test_txt_5,test_csv_5)

vqid1 , vdocid1, vorg1= parse_data(vali_txt_1,vali_csv_1)
vqid2 , vdocid2 ,vorg2= parse_data(vali_txt_2,vali_csv_2)
vqid3 , vdocid3 ,vorg3= parse_data(vali_txt_3,vali_csv_3)
vqid4 , vdocid4 ,vorg4= parse_data(vali_txt_4,vali_csv_4)
vqid5 , vdocid5 ,vorg5= parse_data(vali_txt_5,vali_csv_5)


In [104]:
label = 'label'
features = ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31','32','33','34','35','36','37','38','39','40','41','42','43','44','45','46']

In [105]:
def createDataFrame(filename):
    dframe = pd.read_csv(filename)
    X_train = dframe[features]
    Y_train = dframe[[label]]
    return X_train, Y_train

In [106]:
# Fetch DataFrame format Value
X_train1, Y_train1 = createDataFrame("trainc1.csv")
X_train2, Y_train2 = createDataFrame("trainc2.csv")
X_train3, Y_train3 = createDataFrame("trainc3.csv")
X_train4, Y_train4 = createDataFrame("trainc4.csv")
X_train5, Y_train5 = createDataFrame("trainc5.csv")

X_test1, Y_test1 = createDataFrame("testc1.csv")
X_test2, Y_test2 = createDataFrame("testc2.csv")
X_test3, Y_test3 = createDataFrame("testc3.csv")
X_test4, Y_test4 = createDataFrame("testc4.csv")
X_test5, Y_test5 = createDataFrame("testc5.csv")

X_vali1, Y_vali1 = createDataFrame("valic1.csv")
X_vali2, Y_vali2 = createDataFrame("valic2.csv")
X_vali3, Y_vali3 = createDataFrame("valic3.csv")
X_vali4, Y_vali4 = createDataFrame("valic4.csv")
X_vali5, Y_vali5 = createDataFrame("valic5.csv")


In [107]:
warnings.simplefilter('ignore')
cValues = {}
def calculateC(xlabel,ylabel,xval,yval):
    for i in range(1,10):
        svm = SVC(C=i,probability=True)
        svm.fit(xlabel, ylabel)
        cValues[i] = svm.score(xval, yval)
    return cValues   


In [108]:
cscore1 = calculateC(X_train1,Y_train1,X_vali1,Y_vali1)
cscore2 = calculateC(X_train2,Y_train2,X_vali2,Y_vali2)
cscore3 = calculateC(X_train3,Y_train3,X_vali3,Y_vali3)
cscore4 = calculateC(X_train4,Y_train4,X_vali4,Y_vali4)
cscore5 = calculateC(X_train5,Y_train5,X_vali5,Y_vali5)



In [113]:
max_1 = findmaxaccuracy(cscore1)
max_2 = findmaxaccuracy(cscore2)
max_3 = findmaxaccuracy(cscore3)
max_4 = findmaxaccuracy(cscore4)
max_5 = findmaxaccuracy(cscore5)
highest_c_value =max(max_1,max_2,max_3,max_4,max_5)

print(highest_c_value)

9


In [110]:
def findmaxaccuracy(cscore):
    
    for key,value in cscore.items():
        cscore[key] = cscore[key]/5
    
    cmax = max(cscore, key=cscore.get)
    return cmax

In [313]:
def predictionprobability(highest_c_value,xlabel,ylabel,xtest):
    svm = SVC(C = highest_c_value, kernel = 'rbf',probability=True)
    svm.fit(xlabel, ylabel)
    prediction = svm.predict_log_proba(xtest)
    return prediction

In [314]:
#Prediction Values

pred1 = predictionprobability(highest_c_value,X_train1, Y_train1,X_test1)
pred2 = predictionprobability(highest_c_value,X_train2, Y_train2,X_test2)
pred3 = predictionprobability(highest_c_value,X_train3, Y_train3,X_test3)
pred4 = predictionprobability(highest_c_value,X_train4, Y_train4,X_test4)
pred5 = predictionprobability(highest_c_value,X_train5, Y_train5,X_test5)

In [315]:
#qdict = defaultdict(list)
def querymapping(qid,torg,pred):
    qdict = defaultdict(list)
    i=0
    for queryid in qid:
        qdict[qid[i]].append([pred[i][1],torg[i]]) 
       
        i+=1
    return qdict    
    

In [316]:

querymapping1 = querymapping(tqid1,torg1,pred1)
querymapping2 = querymapping(tqid2,torg2,pred2)
querymapping3 = querymapping(tqid3,torg3,pred3)
querymapping4 = querymapping(tqid4,torg4,pred4)
querymapping5 = querymapping(tqid5,torg5,pred5) 


## Part 2.2: NDCG (20 points)

Based on your prediction file (results could be ranked by scores in the prediction file) and ground-truth (i.e., 0,1,2) in the test file, calculate NDCG for each query. Report average NDCG for all queries in the five-fold cross validation.

For NDCG, please bulid your own function rather then using any package.

In [317]:
def NDCG(prob_rel,orginal_rel):
    count = 0
    DCG = 0
    IDCG = 0  
    iteration_len =len(prob_rel)
    for i in range(iteration_len):
        if (count!=10):
            DCG += ((2**prob_rel[i][1])-1)/(math.log(1+i+1))
            IDCG += ((2**orginal_rel[i][1])-1)/(math.log(1+i+1))
            count +=1
    if IDCG == 0:
        NDCG = 0
    else:
        NDCG = (DCG/IDCG)
    return NDCG

In [318]:
def meanValNDCG(querymapping):
    ndcg_val = []
    for i in querymapping:
        sortby_prob = []
        sortby_relevance = []
        for j in range(len(querymapping[i])):
            sortby_prob.append(querymapping[i][j])
            sortby_relevance.append(querymapping[i][j])  
        sortby_prob.sort(reverse = True)
        sortby_relevance.sort(key = lambda x: x[1],reverse=True) 
        ndcg_val.append(NDCG(sortby_prob,sortby_relevance))
        output = mean(ndcg_val)
    return output
    

In [319]:
# your code here
ndcg1 = meanValNDCG(querymapping1)
ndcg2 = meanValNDCG(querymapping2)
ndcg3 = meanValNDCG(querymapping3)
ndcg4 = meanValNDCG(querymapping4)
ndcg5 = meanValNDCG(querymapping5)
final_ndcg = (ndcg1+ ndcg2 + ndcg3 + ndcg4 + ndcg5)/5


In [320]:
def printNDCGValues():
    print("{} : {}".format("NDCG VALUE FOR FOLD 1", ndcg1))
    print("{} : {}".format("NDCG VALUE FOR FOLD 2", ndcg2))
    print("{} : {}".format("NDCG VALUE FOR FOLD 3", ndcg3))
    print("{} : {}".format("NDCG VALUE FOR FOLD 4", ndcg4))
    print("{} : {}".format("NDCG VALUE FOR FOLD 5", ndcg5))
    print()
    print("{} : {}".format("AVERAGE NDCG VALUE FOR ALL FOLDS", final_ndcg))

In [322]:
printNDCGValues()

NDCG VALUE FOR FOLD 1 : 0.44504962468160053
NDCG VALUE FOR FOLD 2 : 0.436514126463978
NDCG VALUE FOR FOLD 3 : 0.4546122672042186
NDCG VALUE FOR FOLD 4 : 0.507081931367985
NDCG VALUE FOR FOLD 5 : 0.5118200007055816

AVERAGE NDCG VALUE FOR ALL FOLDS : 0.47101559008467275


## (BONUS) Pairwise Learning to Rank (5 points)

Rather than use the point-wise approach as in Part 2.1, instead try to implement a paiwise approach.

In [478]:
import numpy
def pairwiseX(xtrain):
    X, yp, diff = [], [], []
    X_new = []
    X.append(X_train1['1'])
    for j in range(1,46):
        row1 = str(j)
        row2 = str(j+1)
        X.append(xtrain[row1] - xtrain[row2])
    X.append(xtrain['46'])
    Dict = {} 
    for k in range(len(X[1])):  #9630#
        Dict[1] = X[1][k]

    for i in range(2,46):  # 46
        X_new = []
        for q in range(len(X[i])):  #9630
            X_new.append(X[i][q])
        
        #print(len(X_new))  #9630
        Dict[i] = X_new
    
    for f in range(len(X[46])):  #9630
        Dict[46] = X[46][f]
    return Dict    

In [481]:
data1 = pairwiseX(X_train1)
data2 = pairwiseX(X_train2)
data3 = pairwiseX(X_train3)
data4 = pairwiseX(X_train4)
data5 = pairwiseX(X_train5)

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
df4 = pd.DataFrame(data4)
df5 = pd.DataFrame(data5)



In [485]:
def pairwiseY(ytrain):
    k = 0
    yp, diff = [], []

    diff.append(ytrain['label'][0]) 
    yp.append(np.sign(diff[-1]))    
    for i in range(0,len(ytrain['label'])-1):
        diff.append(ytrain['label'][i] - ytrain['label'][i+1])
    
        yp.append(np.sign(diff[-1]))
    if yp[-1] != (-1) ** k:
        yp[-1] *= -1
   
        diff[-1] *= -1
       
        k += 1
    yp, diff = map(np.asanyarray, (yp, diff))
    
    return yp
    

In [489]:
yp1 = pairwiseY(Y_train1)
yp2 = pairwiseY(Y_train2)
yp3 = pairwiseY(Y_train3)
yp4 = pairwiseY(Y_train4)
yp5 = pairwiseY(Y_train5)

In [490]:
pred1_new = predictionprobability(10,df1,yp1,X_test1)
pred2_new = predictionprobability(10,df2,yp2,X_test2)
pred3_new = predictionprobability(10,df3,yp3,X_test3)
pred4_new = predictionprobability(10,df4,yp4,X_test4)
pred5_new = predictionprobability(10,df5,yp5,X_test5)


## Collaboration declarations

*If you collaborated with anyone (see Collaboration policy at the top of this homework), you can put your collaboration declarations here.*

In [None]:
Part 1.1, 1.2 , 1.3 : Implemented on my own post understanding of the concept.
Part 2.1 : Discussed with a fellow classmate and implemented the code on my own
Part 2.2 : Discussed with a classmate on the formula to be applied    