### Guideline

In this assignment, you will be implementing two clustering validation measures: Normalized Mutual Information (NMI) and Jaccard similarity.

You will be given one ground-truth clustering (partition) results and five clustering test cases. You need to evaluate the clustering test cases with regard to the ground-truth by NMI and Jaccard measures and submit your measures. You will be graded based on whether your measures are correct.

Each clustering result (both ground-truth and test cases) is represented by a file. Each line in a file consists of two integers, separated by a space. The first integer represents the id of a data item, and the second integer represents the id of the cluster which this item belongs to.

You need to submit a file titled "scores.txt" consisting of 5 lines. Each line contains two float numbers separated by a space. The first number of the i-th line represents the NMI measure you calculated for the i-th test case i (i.e. "clustering_i.txt") with regard to the ground-truth given in "partitions.txt", and the second number of the i-th line represents the Jaccard measure you calculated for the i-th test case.

In [31]:
import pandas as pd
import numpy as np
from sklearn.metrics.cluster import normalized_mutual_info_score
from sklearn.metrics import jaccard_similarity_score

In [32]:
# read file
truth = pd.read_csv('data/partitions.txt',sep=' ',names=['id','label'],index_col=['id'])
clusters = []
for i in range(1,6):
    filename = str(i).join(['data/clustering_','.txt'])
    clusters.append(pd.read_csv(filename ,sep=' ',names=['id','label'],index_col=['id']))

In [40]:
NMI_score = [normalized_mutual_info_score(cluster["label"], truth["label"]) for cluster in clusters]
Jaccard_score = [jaccard_similarity_score(truth["label"], cluster["label"]) for cluster in clusters]

In [45]:
scores = pd.DataFrame({'NMI':NMI_score,'Jaccard':Jaccard_score},columns=['NMI','Jaccard'])
scores.to_csv('scores.txt',sep=' ',header=False,index=False)
scores

Unnamed: 0,NMI,Jaccard
0,0.889625,0.911689
1,0.645637,0.679484
2,0.391544,0.46493
3,0.767789,0.800598
4,0.76117,0.597586
