# Learning to Rank 

For this part, we're going to play with some Microsoft LETOR data that has query-document relevance judgments. Let's see how learning to rank works in practice. 

First, we will need to download the MQ2008.zip file from the Resources tab on Piazza. This is data from the [Microsoft Research IR Group](https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/).

The data includes 15,211 rows. Each row is a query-document pair. The first column is a relevance label of this pair (0,1 or 2--> the higher value the more related), the second column is query id, the following columns are features, and the end of the row is comment about the pair, including id of the document. A query-document pair is represented by a 46-dimensional feature vector. Features are a numeric value describing a document and query such as TFIDF, BM25, Page Rank, .

The dataset is ready for analysis: It has already been split into 5 folds (see the five folders called Fold1, ..., Fold5).

Using [SVM-rank] This is the basic ranking SVM. We'll see that SVM-rank considers pairwise relevance between docs -- so based on the training data it will transform the data into pairs -- like D1 > D2 and then learn a separator.


##  Optimizing SVM-Rank 

Explore how the different parameters affect the quality of the Ranking SVM. We'll see that we can vary the kernel function, the loss function and so forth. 

We should run SVM-Rank using the default options over each of the five folds.We should find the error on the test set (for example, depending on our settings, svm_rank_classify will give the zero/one error statistics (that is, the number of correct pairs and the number of incorrect pairs). Report the average. 

Then try different parameters and report how they impact the quality of results. 

**Regularisation Parameter: (c)**

I tried 3 values for c parameter - 0.001, 20, 200 with linear kernel.
Although, the error rate in all the the  

*-c 0.001  
Cross Validation Error = 62.366*
  
*-c 20   
Cross Validation Error = 62.459*

*-c 200 
Cross Validation Error = 61.978*

Expectations are that if we increase the regularisation, our model will have less variance and may perform better at test dataset. However, in the above experiment, the varying c is not impacting the error rate much.

Probably, if we use a different kernel, rbf, which is better than linear kernel, we may notice some substantial difference in the cross validation error 
  
**Loss Function: (l)**   
All the earlier values reported in this question are using loss function as the number of swapped pairs.
Instead, if we use l = 2, which is the fraction of swapped pairs, we can't notice, much change in the error rate. They are within range of ~1-2%

*-c 20 -l 2   
Fold 1: Zero/one-error on test set: 57.05%   
Fold 2: Zero/one-error on test set: 54.14%    
Fold 3: Zero/one-error on test set: 62.42%     
Fold 4: Zero/one-error on test set: 69.34%     
Fold 5: Zero/one-error on test set: 63.69%*


**Algorithm: (w) ** 

All the values reported in this question are using 1-slack dual algorithm.
Besides this, we tried experimenting with w=3 and w=4

*-c 20 -w 3   
Cross Validation Error = 62.624*
  
  
*-c 20   -w = 4  
Cross Validation Error = 63.567*

Algorithm change along with the linear kernel is not impacting the error rate in any major way.

**Tolerance: (e)**
Expectations are that if we increase the tolerance drastically, the error rate should increase. We observed the similar results in our experiments as well.

We kept tolerance to e = 2 along with c = 20. This setting increased the error rate for all folds drastically by approximately 7-8%. 

*-c 20 -e 2  
Fold 1: Zero/one-error on test set:  67.31%  
Fold 2: Zero/one-error on test set: 66.43%     
Fold 3: Zero/one-error on test set: 71.34%   
Fold 4: Zero/one-error on test set: 77.71%  
Fold 5: Zero/one-error on test set: 76.43%   
*

**Kernel: (t)** 

### The upshot of this analysis is that if we are using a poor kernel, tinkering with other parameters won't improve my accuracy by much margin.


## Noise!

Now we're going to investigate whether the ranking SVM is easily influenced by noisy features. For example, what if some of the features have are in error? Or what if we downloaded only a portion of a page to calculate a feature? (so the count of inlinks would be wrong)? 

In this case, add some noise to the features. What happens to the results? We may choose to add random noise throughout, noise to a single feature, noise to multiple features, etc.  We aim to see what kind of exploration we conduct and what we conclude.

In [3]:
train_data2 = pd.read_csv('MQ2008/Fold2/train.txt', sep = " ", header = None)
train_data2.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,47,48,49,50,51,52,53,54,55,56
0,0,qid:11909,1:0.049554,2:0.000000,3:0.000000,4:0.000000,5:0.048537,6:0.000000,7:0.000000,8:0.000000,...,46:1.000000,#docid,=,GX000-01-8011551,inc,=,1.0,prob,=,0.278009
1,0,qid:11909,1:0.000000,2:0.000000,3:0.400000,4:0.166667,5:0.000000,6:0.000000,7:0.000000,8:0.000000,...,46:0.000000,#docid,=,GX012-13-11604073,inc,=,1.0,prob,=,0.081141
2,0,qid:11909,1:0.012478,2:0.800000,3:0.000000,4:1.000000,5:0.014989,6:0.000000,7:0.000000,8:0.000000,...,46:0.166667,#docid,=,GX036-18-10002856,inc,=,0.002252,prob,=,0.109118
3,1,qid:11909,1:0.049198,2:0.000000,3:0.000000,4:0.000000,5:0.048180,6:0.000000,7:0.000000,8:0.000000,...,46:0.000000,#docid,=,GX043-50-8139281,inc,=,1.0,prob,=,0.256
4,2,qid:11909,1:0.252050,2:1.000000,3:1.000000,4:0.000000,5:0.254818,6:0.000000,7:0.000000,8:0.000000,...,46:0.527778,#docid,=,GX062-53-0946803,inc,=,0.00794,prob,=,0.205853


*We are trying to add noise to the training data for Fold 2 only.*

I am using Pandas to visualise the LETOR dataset. 
One way we can introduce noise in the signal is to choose one or more specific columns, and repopulate the column with random values generated by a pseudo random number generator. Prepend the random number generated with the "feature number and a colon"

In [4]:
def add_noise_single_feature(feature_number, df):
    column = feature_number + 1
    size = len(df)
    noise = np.random.normal(0,1,size)
    noisy_feature = [str(feature_number)  + ":" + str(i) for i in noise]
    df[column] = noisy_feature

### Adding noise in 7th feature

In [7]:
add_noise_single_feature(7, train_data2)
train_data2.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,47,48,49,50,51,52,53,54,55,56
0,0,qid:11909,1:0.049554,2:0.000000,3:0.000000,4:0.000000,5:0.048537,6:0.000000,7:-0.0673859344042,8:0.000000,...,46:1.000000,#docid,=,GX000-01-8011551,inc,=,1.0,prob,=,0.278009
1,0,qid:11909,1:0.000000,2:0.000000,3:0.400000,4:0.166667,5:0.000000,6:0.000000,7:-1.30251764737,8:0.000000,...,46:0.000000,#docid,=,GX012-13-11604073,inc,=,1.0,prob,=,0.081141
2,0,qid:11909,1:0.012478,2:0.800000,3:0.000000,4:1.000000,5:0.014989,6:0.000000,7:1.3765957986,8:0.000000,...,46:0.166667,#docid,=,GX036-18-10002856,inc,=,0.002252,prob,=,0.109118
3,1,qid:11909,1:0.049198,2:0.000000,3:0.000000,4:0.000000,5:0.048180,6:0.000000,7:1.1916632949,8:0.000000,...,46:0.000000,#docid,=,GX043-50-8139281,inc,=,1.0,prob,=,0.256
4,2,qid:11909,1:0.252050,2:1.000000,3:1.000000,4:0.000000,5:0.254818,6:0.000000,7:0.81053364571,8:0.000000,...,46:0.527778,#docid,=,GX062-53-0946803,inc,=,0.00794,prob,=,0.205853


### Writing this noisy dataframe to a text file to use it for testing the svm_rank_learn_model

In [8]:
train_data2.to_csv("train_noisy.txt", sep= " ", header=None, index=False)

With *Linear kernel* and c parameter set to 20, we calculated the below error.

Zero/one-error on test set: 57.32% 

We repeated the above step for 5 more iterations, each time randomly choosing one random number between 1 and 46, introducing the noise in that particular feature column. In all the 5 iterations, there was no effect on the error rate.

As concluded above, there is no major effect on the Zero-One Error if only *one feature* is noisy. 
### Let's repeat the above experiment, but this time we will increase the number of noisy features and see the effects on the error as the number of noisy features increase

In [9]:
from random import randint
def add_noise(noisy_features_count, df):
    for i in xrange(noisy_features_count):
        feature_number = randint(1, 46)
        add_noise_single_feature(feature_number, df)

**Adding noise in random 5 features:**

With *Linear kernel* and c parameter set to 20, we calculated the below error.

Fold2: Test data: Zero/one-error on test set: 56.69%  
Fold4: Test data: Zero/one-error on test set: 67.52%  
Fold5: Test data: Zero/one-error on test set: 63.69% 

In [10]:
train_data2 = pd.read_csv('MQ2008/Fold2/train.txt', sep = " ", header = None)
add_noise(5, train_data2)
train_data2.to_csv("train_noisy_5.txt", sep= " ", header=None, index=False)

**Adding noise in random 10 features**

With *Linear kernel* and c parameter set to 20, we calculated the below error.

Fold2: Test data: Zero/one-error on test set: 56.8%  
Fold4: Test data: Zero/one-error on test set: 68.15%  
Fold5: Test data: Zero/one-error on test set: 64.97% 

In [11]:
# Adding noise in random 10 features
train_data2 = pd.read_csv('MQ2008/Fold2/train.txt', sep = " ", header = None)
add_noise(10, train_data2)
train_data2.to_csv("train_noisy_10.txt", sep= " ", header=None, index=False)

**Adding noise in random 15 features**

With *Linear kernel* and c parameter set to 20, we calculated the below error.

Fold2: Test data: Zero/one-error on test set: 57.32%   
Fold4: Test data: Zero/one-error on test set: 68.15%  
Fold5: Test data: Zero/one-error on test set: 62.42% 

In [16]:
# Adding noise in random 15 features
train_data2 = pd.read_csv('MQ2008/Fold2/train.txt', sep = " ", header = None)
add_noise(15, train_data2)
train_data2.to_csv("train_noisy_15.txt", sep= " ", header=None, index=False)
train_data2.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,47,48,49,50,51,52,53,54,55,56
0,0,qid:11909,1:0.049554,2:-0.395239950239,3:0.000000,4:0.000000,5:-1.58582002801,6:0.000000,7:0.249721094832,8:0.000000,...,46:1.000000,#docid,=,GX000-01-8011551,inc,=,1.0,prob,=,0.278009
1,0,qid:11909,1:0.000000,2:-1.36168514986,3:0.400000,4:0.166667,5:-0.369106732819,6:0.000000,7:0.861912354372,8:0.000000,...,46:0.000000,#docid,=,GX012-13-11604073,inc,=,1.0,prob,=,0.081141
2,0,qid:11909,1:0.012478,2:-0.907681125372,3:0.000000,4:1.000000,5:-0.976710216473,6:0.000000,7:0.279797735237,8:0.000000,...,46:0.166667,#docid,=,GX036-18-10002856,inc,=,0.002252,prob,=,0.109118
3,1,qid:11909,1:0.049198,2:-1.04242486225,3:0.000000,4:0.000000,5:0.999196407635,6:0.000000,7:-1.3974802177,8:0.000000,...,46:0.000000,#docid,=,GX043-50-8139281,inc,=,1.0,prob,=,0.256
4,2,qid:11909,1:0.252050,2:-0.431940393175,3:1.000000,4:0.000000,5:1.1356441171,6:0.000000,7:0.00352635031363,8:0.000000,...,46:0.527778,#docid,=,GX062-53-0946803,inc,=,0.00794,prob,=,0.205853


**Adding noise in random 25 features**

With Linear kernel and c parameter set to 20, we calculated the below error.

Fold2: Test data: Zero/one-error on test set: 56.8%  
Fold4: Test data: Zero/one-error on test set: 67.52%  
Fold5: Test data: Zero/one-error on test set: 64.33% 

In [13]:
# Adding noise in random 25 features
train_data2 = pd.read_csv('MQ2008/Fold2/train.txt', sep = " ", header = None)
add_noise(25, train_data2)
train_data2.to_csv("train_noisy_25.txt", sep= " ", header=None, index=False)

Adding noise in random 35 features

Fold2: Test data: Zero/one-error on test set: 56.8%  
Fold4: Test data: Zero/one-error on test set: 69.05%  
Fold5: Test data: Zero/one-error on test set: 68.95% 

In [14]:
# Adding noise in random 35 features
train_data2 = pd.read_csv('MQ2008/Fold2/train.txt', sep = " ", header = None)
add_noise(35, train_data2)
train_data2.to_csv("train_noisy_35.txt", sep= " ", header=None, index=False)

Adding noise in random 40 features

Fold2: Test data: Zero/one-error on test set: 54.08%  
Fold4: Test data: Zero/one-error on test set: 69.06%  
Fold5: Test data: Zero/one-error on test set: 68.32% 

In [17]:
# Adding noise in random 40 features
train_data2 = pd.read_csv('MQ2008/Fold2/train.txt', sep = " ", header = None)
add_noise(40, train_data2)
train_data2.to_csv("train_noisy_40.txt", sep= " ", header=None, index=False)
train_data2.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,47,48,49,50,51,52,53,54,55,56
0,0,qid:11909,1:0.049554,2:1.3683923423,3:0.000000,4:-0.0224678092047,5:0.048537,6:-1.23246540084,7:1.07740580979,8:0.000000,...,46:-2.00125664356,#docid,=,GX000-01-8011551,inc,=,1.0,prob,=,0.278009
1,0,qid:11909,1:0.000000,2:1.89833159636,3:0.400000,4:-0.058728743646,5:0.000000,6:-0.746285719057,7:-0.462650867631,8:0.000000,...,46:0.199754961202,#docid,=,GX012-13-11604073,inc,=,1.0,prob,=,0.081141
2,0,qid:11909,1:0.012478,2:0.231658124262,3:0.000000,4:-0.127336691611,5:0.014989,6:-0.724090964036,7:1.28917989999,8:0.000000,...,46:0.739613626626,#docid,=,GX036-18-10002856,inc,=,0.002252,prob,=,0.109118
3,1,qid:11909,1:0.049198,2:0.866348873695,3:0.000000,4:-0.306919429092,5:0.048180,6:-0.0725845908323,7:0.247519059051,8:0.000000,...,46:-1.00558574207,#docid,=,GX043-50-8139281,inc,=,1.0,prob,=,0.256
4,2,qid:11909,1:0.252050,2:-1.58582314678,3:1.000000,4:0.233425842902,5:0.254818,6:-0.0654008660494,7:-0.337632516257,8:0.000000,...,46:0.982044021048,#docid,=,GX062-53-0946803,inc,=,0.00794,prob,=,0.205853


### Analysis of adding noise to the dataset:

All the models were run using 'Linear' kernel with c (regularizatio parameter) set to 20.0.
Noise was added to Fold's 2 training data.

So, the base error for comparing our noisy dataset performance is the respective fold's test error.
Above results show that with the increase in the count of noisy features in the dataset, the error is not increasing drastically. It never increases beyond 5% from the existing error.

This is unexpected. Because if 40 features out of 46 are noisy, then the error should have increased drastically.

One of my guesses why this is the case is because our base error (with which we are comparing our noisy features error) is pathetic. More than 55% of the predictions made by the base model are incorrect.

Therefore, adding noise to the already bad model, is not making much difference to the error.

Another conclusion we may make from this experiment is that the rank_svm algorithm is quite robust to the noise. 

Had we used a rbf kernel, which has much better performance than the linear kernel, we would have been able to see the expected increase in the error rate with the increase in the noisy features. However, due to the computing resources constraint and long training time for rbf models, we did the noise addition part on the linear model.

