# K-means clusteriing on Hadoop with MRJob

## Preprocess log file data


This dataset captures which areas (Vroots) of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998.


#### Data Format
<PRE>
The data is in an ASCII-based sparse-data format called "DST". Each line of the data file starts with a letter which tells the line's type. The three line types of interest are:
-- Attribute lines:
For example, 'A,1277,1,"NetShow for PowerPoint","/stream"'
Where:
  'A' marks this as an attribute line, 
  '1277' is the attribute ID number for an area of the website (called a Vroot),
  '1' may be ignored, 
  '"NetShow for PowerPoint"' is the title of the Vroot, 
  '"/stream"' is the URL relative to "http://www.microsoft.com"

Case and Vote Lines:
For each user, there is a case line followed by zero or more vote lines.
For example:
  C,"10164",10164
  V,1123,1
  V,1009,1
  V,1052,1
Where:
  'C' marks this as a case line, 
  '10164' is the case ID number of a user, 
  'V' marks the vote lines for this case, 
  '1123', 1009', 1052' are the attributes ID's of Vroots that a user visited. 
  '1' may be ignored.
</PRE>
---
 Here, you must transform/preprocess the data on a single node (i.e., not on a cluster of nodes) from the following format:

- C,"10001",10001   #Visitor id 10001
- V,1000,1          #Visit by Visitor 10001 to page id 1000
- V,1001,1          #Visit by Visitor 10001 to page id 1001
- V,1002,1          #Visit by Visitor 10001 to page id 1002
- C,"10002",10002   #Visitor id 10001
- V
- Note: #denotes comments


to the following format (V, PageID, 1, C, Visitor):

- V,1000,1,C, 10001
- V,1001,1,C, 10001
- V,1002,1,C, 10001



In [1]:

from csv import reader, writer

# set output
outfile = open('anonymous-msweb-preprocessed.data', 'w')
webdata_preprocessed = writer(outfile)

# read in data and preprocess
with open('/home/cloudera/W261/week4/anonymous-msweb.data', 'r') as infile:
    webdata = reader(infile)
    for row in webdata:
        # If Case line
        if row[0]=='C':
            # save case id
            Visitor = row[1]
        #  if Vote line
        elif row[0]=='V':
            # save page id
            PageID = row[1]
            # output in the format of (V, PageID, 1, C, Visitor)
            webdata_preprocessed.writerow(['V',PageID,1,'C',Visitor])
        # do not print if any other line type
        else:
            pass


In [2]:
!head -10 anonymous-msweb-preprocessed.data
!wc -l anonymous-msweb-preprocessed.data

V,1000,1,C,10001
V,1001,1,C,10001
V,1002,1,C,10001
V,1001,1,C,10002
V,1003,1,C,10002
V,1001,1,C,10003
V,1003,1,C,10003
V,1004,1,C,10003
V,1005,1,C,10004
V,1006,1,C,10005
98531 anonymous-msweb-preprocessed.data


Save the attributes also

In [104]:
from csv import reader, writer

# set output (write in binary mode otherwise, I can't fathom why, the write is not complete but stops at page id 1036)
outfile = open('anonymous-msweb-attributes.data', 'wb')
webdata_preprocessed = writer(outfile)

# read in data and preprocess
with open('/home/cloudera/W261/week4/anonymous-msweb.data', 'r') as infile:
    webdata = reader(infile)
    for row in webdata:
        # If Case line
        if row[0]=='A':
            webdata_preprocessed.writerow(row)
        # do not print if any other line type
        else:
            pass


In [376]:
!head -10 anonymous-msweb-attributes.data

A,1287,1,International AutoRoute,/autoroute
A,1288,1,library,/library
A,1289,1,Master Chef Product Information,/masterchef
A,1297,1,Central America,/centroam
A,1215,1,For Developers Only Info,/developer
A,1279,1,Multimedia Golf,/msgolf
A,1239,1,Microsoft Consulting,/msconsult
A,1282,1,home,/home
A,1251,1,Reference Support,/referencesupport
A,1121,1,Microsoft Magazine,/magazine


## EDA: Find the most frequent pages



In [303]:
%%writefile MostFrequentVisits.py
#!/usr/bin/env python

from mrjob.job import MRJob 
from mrjob.step import MRStep
from mrjob.protocol import RawProtocol

class MRSortedVisits(MRJob):
    
    # ----------------------------- options -------------------------- #
    # change protocols for sorting
#    MRJob.SORT_VALUES = True  
    OUTPUT_PROTOCOL = RawProtocol
    INTERNAL_PROTOCOL = RawProtocol

    # -------------------- step definitions ------------------------- #
    
    def mapper_visit_emitter(self,_,line):
        '''Emits page id and a count of 1 for each line in the data set'''
        tokens= line.split(',')
        yield tokens[1],'1'
        
    
    def reducer_visit_counter(self,page,counts):
        '''Sums up page visits for each page id'''
        #the internal RawProtocol requires to loop over the values (counts) for a given key (page)
        count_page = 0
        for count in counts:
            count_page += int(count)
        #RawProtocol expects strings
        yield str(count_page),str(page)
    
    def reducer_top_5_counter(self):
        '''Keeps track of reducer calls which coincides with highest freq. page visits 
           if the input array is reverse sorted'''
        self.top_n = 0    
    
    def identity_reducer(self,count,page):
        '''Emits the first 5 records'''
        self.top_n += 1
        # page is read as a generator, exhaust it
        if self.top_n <= 5:
            for p in page:
                yield str(count),p
        else:
            pass
        
    def steps(self): return [
        # Step 1: emit visit frequencies
        MRStep(mapper=self.mapper_visit_emitter,
               reducer=self.reducer_visit_counter,
              ),
        # Step 2: sort in reverse order by visit frequency    
        MRStep(reducer_init = self.reducer_top_5_counter,
                reducer=self.identity_reducer,
               jobconf={
                      'stream.num.map.output.key.fields':'2',
                      'stream.map.output.field.separator':'\t', 
                      'mapreduce.job.output.key.comparator.class':'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                      'mapreduce.partition.keycomparator.options': '-k1,1nr'})
        ]
    
# ---------------------------------------------- run ------------------------------------------ #
    
if __name__ == '__main__': 
    MRSortedVisits.run()


Overwriting MostFrequentVisits.py


In [151]:
!chmod +x MostFrequentVisits.py

In [304]:
!echo 'Count    PageID'
!echo '---------------'
!python MostFrequentVisits.py anonymous-msweb-preprocessed.data -r hadoop -q

Count    PageID
---------------
10816	1008	
9370	1034	
8451	1004	
5325	1018	
5104	1017	


## Find the most frequent visitor



In [61]:
%%writefile mostFrequentVisitors.py
#!/usr/bin/env python



from mrjob.job import MRJob 
from mrjob.step import MRStep
#from mrjob.protocol import RawProtocol
from collections import defaultdict

class MRmostFrequentVisitors(MRJob):
    

    # -------------------- step definitions ------------------------- #
    def mapper_visitor_emitter(self,_,line):
        tokens= line.split(',')
        yield tokens[1],[tokens[-1],1]
        
    def reducer_visitor_dictionary(self):
        # save {page:{visitor:count}} in a nested dictionary with a 0 default count
        self.visitors = defaultdict(lambda: defaultdict(int))
    
    def reducer_visitor_counter(self,page,visitor_count):
        for visitor,count in visitor_count:
#            page_visitor_key = page+':'+visitor
            self.visitors[page][visitor] += int(count)
        yield page,self.visitors
    

    def reducer_max_freq(self,page,visitor_counts):
        for visitor_count in visitor_counts:
            max_key = max(visitor_count[page], key=lambda key: visitor_count[page][key])
            yield page,max_key
        
    def steps(self): return [
        # Step 1: emit visit frequencies
        MRStep(mapper=self.mapper_visitor_emitter,
               reducer_init=self.reducer_visitor_dictionary,
               reducer=self.reducer_visitor_counter#,
              ),
        # Step 2: emit highest frequency visitor (emit only one in case of tie)
        MRStep(reducer=self.reducer_max_freq)
        ]
    
# ---------------------------------------------- run ------------------------------------------ #
    
if __name__ == '__main__': 
    MRmostFrequentVisitors.run()


Overwriting mostFrequentVisitors.py


In [48]:
!chmod +x mostFrequentVisitors.py

In [121]:
!python mostFrequentVisitors.py anonymous-msweb-preprocessed.data -r local > mostFrequentVisitors.tsv

No configs found; falling back on auto-configuration
Creating temp directory /tmp/mostFrequentVisitors.root.20170205.131429.434179
Running step 1 of 2...
Running step 2 of 2...
Streaming final output from /tmp/mostFrequentVisitors.root.20170205.131429.434179/output...
Removing temp directory /tmp/mostFrequentVisitors.root.20170205.131429.434179...


Merge and print the results with the attributes file which contains the urls.

In [120]:
import re
# initialize a dictionary for easy merging based on keys
attributes = {}
# open both files that should be merged

# ------------------------------- read in data -------------------------------------------- #
with open('anonymous-msweb-attributes.data','r') as attribute_file, open('mostFrequentVisitors.tsv','r') as top_freq_file:
    # read in urls and save them in a dictionary with page_id as an index
    for row in attribute_file:
        row = row.split(',')
        page_id,url = row[1],row[-1]
        attributes[page_id] = url
    # read in page_id and visitor_id
    for n,row in enumerate(top_freq_file):
        page_id,visitor_id = re.findall(r'[0-9]+',row)
        
# ------------------------------- print first 10 lines --------------------------------- #
        if n == 0:
            print 'page_id\tvisitor_id\turl'
            print '-'*30
        print '{0}\t{1}\t{2}'.format(page_id,visitor_id,attributes[page_id])
        if n == 10:
            break

page_id	visitor_id	url
------------------------------
1000	36585	/regwiz

1001	23995	/support

1002	35235	/athome

1003	35546	/kb

1004	35540	/search

1005	10004	/norge

1006	27495	/misc

1007	19492	/ie_intl

1008	35236	/msdownload

1009	23995	/windows

1010	20915	/vbasic



##  Clustering Tweet Dataset

Clustering a dataset consisting of word-frequency distributions 
for 1,000 Twitter users. These Twitter users use language in very different ways,
and were classified by hand according to the criteria:

0: Human, where only basic human-human communication is observed.

1: Cyborg, where language is primarily borrowed from other sources
(e.g., jobs listings, classifieds postings, advertisements, etc...).

2: Robot, where language is formulaically derived from unrelated sources
(e.g., weather/seismology, police/fire event logs, etc...).

3: Spammer, where language is replicated to high multiplicity
(e.g., celebrity obsessions, personal promotion, etc... )

Check out the preprints of  recent research,
which spawned this dataset:

* http://arxiv.org/abs/1505.04342
* http://arxiv.org/abs/1508.01843

The main data lie in the accompanying file:

* [topUsers_Apr-Jul_2014_1000-words.txt](https://www.dropbox.com/s/6129k2urvbvobkr/topUsers_Apr-Jul_2014_1000-words.txt?dl=0)

and are of the form:

USERID,CODE,TOTAL,WORD1_COUNT,WORD2_COUNT,...
.
.

where

USERID = unique user identifier
CODE = 0/1/2/3 class code
TOTAL = sum of the word counts

Using this data, you will implement a 1000-dimensional K-means algorithm in MrJob on the users
by their 1000-dimensional word stripes/vectors using several 
centroid initializations and values of K.

Note that each "point" is a user as represented by 1000 words, and that
word-frequency distributions are generally heavy-tailed power-laws
(often called Zipf distributions), and are very rare in the larger class
of discrete, random distributions. For each user you will have to normalize
by its "TOTAL" column. __Try several parameterizations and initializations__ :

* (A) K=4 uniform random centroid-distributions over the 1000 words (generate 1000 random numbers and normalize the vectors)
* (B) K=2 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution 
* (C) K=4 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution 
* (D) K=4 "trained" centroids, determined by the sums across the classes. Use use the 
(row-normalized) class-level aggregates as 'trained' starting centroids (i.e., the training is already done for you!).
Note that you do not have to compute the aggregated distribution or the 
class-aggregated distributions, which are rows in the auxiliary file:


* [topUsers_Apr-Jul_2014_1000-words_summaries.txt](https://www.dropbox.com/s/w4oklbsoqefou3b/topUsers_Apr-Jul_2014_1000-words_summaries.txt?dl=0)

Row 1: Words
Row 2: Aggregated distribution across all classes
Row 3-6 class-aggregated distributions for clases 0-3
For (A),  we select 4 users randomly from a uniform distribution [1,...,1,000]
For (B), (C), and (D)  you will have to use data from the auxiliary file: 

* [topUsers_Apr-Jul_2014_1000-words_summaries.txt](https://www.dropbox.com/s/w4oklbsoqefou3b/topUsers_Apr-Jul_2014_1000-words_summaries.txt?dl=0)

This file contains 5 special word-frequency distributions:

* (1) The 1000-user-wide aggregate, which you will perturb for initializations
in parts (B) and (C), and
* (2-5) The 4 class-level aggregates for each of the user-type classes (0/1/2/3)


In parts (B) and (C), you will have to perturb the 1000-user aggregate 
(after initially normalizing by its sum, which is also provided).
So if in (B) you want to create 2 perturbations of the aggregate, start
with (1), normalize, and generate 1000 random numbers uniformly 
from the unit interval (0,1) twice (for two centroids), using:

In [None]:
from numpy import random
numbers = random.sample(1000)

Take these 1000 numbers and add them (component-wise) to the 1000-user aggregate,
and then renormalize to obtain one of your aggregate-perturbed initial centroids.

In [None]:
###################################################################################
##Geneate random initial centroids around the global aggregate
##Part (B) and (C) of this question
###################################################################################
def startCentroidsBC(k):
    counter = 0
    for line in open("topUsers_Apr-Jul_2014_1000-words_summaries.txt").readlines():
        if counter == 2:        
            data = re.split(",",line)
            globalAggregate = [float(data[i+3])/float(data[2]) for i in range(1000)]
        counter += 1
    #perturb the global aggregate for the four initializations    
    centroids = []
    for i in range(k):
        rndpoints = random.sample(1000)
        peturpoints = [rndpoints[n]/10+globalAggregate[n] for n in range(1000)]
        centroids.append(peturpoints)
        total = 0
        for j in range(len(centroids[i])):
            total += centroids[i][j]
        for j in range(len(centroids[i])):
            centroids[i][j] = centroids[i][j]/total
    return centroids

For experiments A, B, C and D and iterate until a threshold (try 0.001) is reached.
After convergence, print out a summary of the classes present in each cluster.
In particular, report the composition as measured by the total
portion of each class type (0-3) contained in each cluster,
and discuss your findings and any differences in outcomes across parts A-D.

<h2>K-Means</h2>
K-means is a clustering method that aims to find the positions μi,i=1...k of the clusters that minimize the distance from the data points to the cluster. K-means clustering solves:
<br><br>
$$\arg\min_{c} \sum_{i=1}^k\sum_{{x}\in c_i} d({x},\mu_i) = \arg\min_{c} \sum_{i=1}^k\sum_{{x}\in c_i} \left\Vert {x}-\mu_i \right\Vert_2^2$$
<br><br>
where ${c}_i$ is the set of points that belong to cluster i. The K-means clustering uses the square of the Euclidean distance $d({x},\mu_i) = \left\Vert {x}-\mu_i \right\Vert_2^2$. This problem is not trivial (in fact it is NP-hard), so the K-means algorithm only hopes to find the global minimum, possibly getting stuck in a different solution.

<h2>K-means algorithm</h2>

The Lloyd's algorithm, mostly known as k-means algorithm, is used to solve the k-means clustering problem and works as follows. First, decide the number of clusters k. Then:

<table>
<tbody><tr><td>1. Initialize the center of the clusters</td>
<td>${\mu}_i = $ some value $, i=1,...,k$</td>
</tr>
<tr>
<td>2. Attribute the closest cluster to each data point</td>
<td>${c}_i = \{j: d({x}_j, \mu_i) \le d({x}_j, \mu_l),  l \ne i, j=1,...,n\}$ </td>
</tr>
<tr>
<td>3. Set the position of each cluster to the mean of all data points belonging to that cluster</td>
<td>$\mu_i = \frac{1}{|c_i|}\sum_{j\in c_i} {x}_j,\forall i$</td>
</tr>
<tr><td>4. Repeat steps 2-3 until convergence</td>
<td></td>
</tr>
<tr><td>Notation</td><td>${|c|} = $ number of elements in  ${c}$</td>
</tr>
</tbody>
</table>

<h2>Calculating purity</h2>
![purity illustration](http://www.candpgeneration.com/images/purity.png)

Normalize the data and save it in 'Kmeansdata.csv'

In [57]:
with open('topUsers_Apr-Jul_2014_1000-words.txt', 'r') as infile, open('Kmeandata.csv', 'w') as outfile:
    for line in infile:
        line = line.split(',')
        USERID,CODE,TOTAL,word_freqs = line[0],line[1],line[2],line[3:] 
        # keep classification information for counting
        out = list(CODE)
        # save classification information, normalize by dividing by total word counts
        [out.append(float(word_freq)/float(TOTAL)) for word_freq in word_freqs]
        outfile.write(','.join(str(j) for j in out)+'\n')

In [2]:
%%writefile Kmeans.py
#!/usr/bin/env python

import numpy as np
from numpy import argmin, array, random
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import chain
import os
import sys
import re

def normalize(x):
    return x/np.sum(x)

def add_vectors(x,y):
    return map(sum,zip(x,y))

def purity(freq_table):
    majority = np.max(freq_table,axis=1)
    purity = float(np.sum(majority))/np.sum(freq_table)
    return purity

def uniform_centroid_generator(k,size):
    '''Generate k x size dimensional uniformly distributed, normalized numbers.'''
    
    centroid_points = []
    # generate normalized uniform random numbers
    for i in range(k):
        norm_unif_nums = normalize(np.random.uniform(size=size))
        centroid_points.append(norm_unif_nums)
     
    # ----------------- output    
    # save the centroid_points locally as a k x size dimensional file
    with open('Centroids.txt','w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)

    return centroid_points

        
def perturbation_centroid_generator(k,size):
    counter = 0
    centroid_points=[]
    
    # Read in the user-wide aggregate, then normalize
    with open('topUsers_Apr-Jul_2014_1000-words_summaries.txt','r') as infile:
        for line in infile:
            line = line.split(',')
            if counter == 1:
                globalAggregate = [int(word_sum)/float(line[2]) for word_sum in line[3:]]
            counter += 1
        # Perturb the globalAggregate for each initialization
    for i in range(k):
        rndpoints = random.sample(size)
        pert_centroid_points = add_vectors(globalAggregate,rndpoints/10)
        norm_pert_centroid_points = normalize(pert_centroid_points)
        centroid_points.append(norm_pert_centroid_points)
    # ----------------- output    
    with open('Centroids.txt', 'w+') as f:
         f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)
            
    return centroid_points
    

def trained_centroid_generator(k):
    
    centroid_points=[]
    
    # Read in the user-wide aggregate, then normalize
    with open('topUsers_Apr-Jul_2014_1000-words_summaries.txt','r') as infile:
        for line in infile:
            line = line.split(',')
            if line[0]=='CODE':
                aggregate = [int(word_freq)/float(line[2]) for word_freq in line[3:]]
                centroid_points.append(aggregate)
  
    with open('Centroids.txt', 'w+') as f:
         f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)
            
    return centroid_points
    
def centroid_generator(centroid_type,k,size=1000):
    
    if centroid_type == 'uniform':
        return uniform_centroid_generator(k,size)
        
    elif centroid_type == 'perturbation':
        return perturbation_centroid_generator(k,size)
        
    elif centroid_type == 'trained':
        return trained_centroid_generator(k)
        
    else:
        raise ValueError("Unknown centroid type. Choose 'uniform', 'perturbation' or 'trained'.")

#Calculate find the nearest centroid for data point 
def MinDist(datapoint, centroid_points):
    datapoint = array(datapoint)
    centroid_points = array(centroid_points)
    diff = datapoint - centroid_points 
    diffsq = diff*diff
    # Get the nearest centroid for each instance
    minidx = argmin(list(diffsq.sum(axis = 1)))
    return minidx

#Check whether centroids converge
def stop_criterion(centroid_points_old, centroid_points_new,T):
    oldvalue = list(chain(*centroid_points_old))
    newvalue = list(chain(*centroid_points_new))
    Diff = [abs(x-y) for x, y in zip(oldvalue, newvalue)]
    Flag = True
    for i in Diff:
        if(i>T):
            Flag = False
            break
    return Flag

class MRKmeans(MRJob):

    def steps(self):
        return [
            MRStep(mapper_init = self.mapper_init, 
                   mapper=self.mapper,
                   combiner_init = self.combiner_init,
                   combiner = self.combiner,
                   reducer_init = self.reducer_init,
                   reducer=self.reducer)
               ]
    #load centroids info from file
    def mapper_init(self):        
        self.centroid_points = [map(float,s.split('\n')[0].split(',')) for s in open("Centroids.txt").readlines()]
        #open('Centroids.txt', 'w').close()

    #load data and output the nearest centroid index and data point 
    def mapper(self, _, line):
        D = (map(float,line.split(',')))
        # first element is the classification, to provide a counter, convert it to a vector of length 4
        # with [#Human,#Cyborg,#Robot,#Spammer]
        class_freqs = [0]*4
        class_freqs[int(D[0])] = 1
        # normalized frequencies start from the second element
        norm_freqs = D[1:]
        # Output:
        yield int(MinDist(norm_freqs,self.centroid_points)), (norm_freqs,class_freqs,1)
        
    def combiner_init(self):
        self.n = 1000
        
    #Combine sum of data points locally
    def combiner(self, idx, inputdata):
        norm_freqs_sum = [0]*self.n
        class_freqs_sum = [0]*4
        count = 0
        
        for norm_freqs, class_freqs, n in inputdata:
            count += n
            class_freqs_sum = add_vectors(class_freqs_sum,class_freqs)
            norm_freqs_sum = add_vectors(norm_freqs_sum,norm_freqs)
        
        yield idx,(norm_freqs_sum,class_freqs_sum,count)
        
    def reducer_init(self):
        self.n = 1000
        
    #Aggregate sum for each cluster and then calculate the new centroids
    def reducer(self, idx, inputdata): 
        centroids = []
        count = 0 
        norm_freqs = [0]*self.n
        class_freqs = [0]*4
        for norm_freqs_sum, class_freqs_sum, n in inputdata:
            count += n
            class_freqs = add_vectors(class_freqs_sum,class_freqs)
            norm_freqs =  add_vectors(norm_freqs_sum,norm_freqs)
        
        # the new centroids are the means of the sums of all the member vector elements
        centroids = [x/count for x in norm_freqs_sum]

        yield idx,(centroids,class_freqs)
    
      
if __name__ == '__main__':
    MRKmeans.run()

Overwriting Kmeans.py


In [93]:
%%writefile kmeans_runner.py
#!/usr/bin/env python

from __future__ import print_function
import sys
import numpy as np
from numpy import random
from Kmeans import MRKmeans, stop_criterion, centroid_generator, purity

#mr_job = MRKmeans(args=['Kmeandata.csv', '--file=Centroids.txt','--cmdenv, 'k=4'])
mr_job = MRKmeans(args=['Kmeandata.csv', '--file=Centroids.txt'])


# ------------------------------ command line arguments ------------------#
k = int(sys.argv[1])
centroid_type =  sys.argv[2]
# bootstrapping is an embarissingly parallel problem but solve it now serially
bootstrap_rep = int(sys.argv[3])



# ----------------------------- initialize ------------------------------- #


# set up containers

iterations = []
purities = []

for n_boot in range(bootstrap_rep):
    # set boostrap iteration counter
    print('\rBootstrap replications:',n_boot+1,'/',bootstrap_rep,end='')
    sys.stdout.flush()
    
    #Geneate initial centroids
    centroid_points = centroid_generator(centroid_type,k)

    # Update centroids iteratively
    i = 0
    class_freqs=[[0]*4]*k
    while(1):
        # save previous centoids to check convergency
        centroid_points_old = list(centroid_points)
    
        with mr_job.make_runner() as runner: 
            runner.run()
            # stream_output: get access of the output 
            for line in runner.stream_output():
                key,value =  mr_job.parse_output_line(line)
                centroid_points[key] = value[0]
                class_freqs[key] = value[1]   
                    
            # Update the centroids for the next iteration
            with open('Centroids.txt', 'w') as f:
                f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)

        i = i + 1
        if(stop_criterion(centroid_points_old,centroid_points,0.01)):
            break
    # save purity and iteration counts for the bootstrap replicates        
    purities.append(purity(class_freqs))
    iterations.append(i)

# ------------------------------------- report results -------------------------- #

#  ------------------ info line
print('\nCluster type: {0}\n'.format(sys.argv[2]))
print('Number of clusters: {0}\n'.format(sys.argv[1]))



# ------------------ table 

print('\nProportions of classes in clusters are based on the last bootstrap replicate.\n')

# proportions of classes and clusters
obs_class = [0]*4
with open('topUsers_Apr-Jul_2014_1000-words.txt') as infile:
    for line in infile:
        line = line.split(',')
        obs_class[int(line[1])] += 1

class_template = '{0:>7} | {1:5} | {2:6} | {3:5} | {4:7} | {5:^4}'       
prop_template = '{0:7} | {1:5.2f} | {2:6.2f} | {3:5.2f} | {4:7.2f} | {5:4}'     

# header
print(class_template.format('Cluster','Human','Cyborg','Robot','Spammer','N'))
print('-'*49)
# values
for j in range(len(class_freqs)):
    # cluster membership distribution
    col_sums = np.sum(class_freqs,axis=1)
    # class predictions
    pred_class = list(class_freqs[j])
    props = [float(x)/y for x,y  in zip(pred_class,obs_class)]
    print(prop_template.format(j,props[0],props[1],props[2],props[3],col_sums[j]))
# footer
print('-'*49)
row_sums = np.sum(class_freqs,axis=0)
total = np.sum(class_freqs)
print(class_template.format('N',row_sums[0],row_sums[1],row_sums[2],row_sums[3],total))

# ----------------------- Statistics

# purity

mean_purity = np.mean(purities)
purity_q05 = np.round(np.percentile(purities,5),2)
purity_q95 = np.round(np.percentile(purities,95),2)
print('\nMean purity (95% conf. int.): {0} ({1}-{2})'.format(mean_purity,purity_q05,purity_q95))
# iterations
median_iteration = int(np.median(iterations))
iteration_q05 = int(np.ceil(np.percentile(iterations,5)))
iteration_q95 = int(np.ceil(np.percentile(iterations,95)))
print('\nMedian number of iterations (95% conf. int.): {0} ({1}-{2})\n'.format(median_iteration,iteration_q05,iteration_q95))
      
    
        
#print centroid_points




Overwriting kmeans_runner.py


## A: Uniform, 4 clusters

In [1]:
!python kmeans_runner.py 4 uniform 100

Bootstrap replications: 100 / 100
Cluster type: uniform

Number of clusters: 4


Proportions of classes in clusters are based on the last bootstrap replicate.

Cluster | Human | Cyborg | Robot | Spammer |  N  
-------------------------------------------------
      0 |  0.01 |   0.00 |  0.22 |    0.67 |   88
      1 |  0.00 |   0.59 |  0.06 |    0.00 |   57
      2 |  0.98 |   0.02 |  0.04 |    0.25 |  769
      3 |  0.01 |   0.38 |  0.69 |    0.08 |   86
-------------------------------------------------
      N |   752 |     91 |    54 |     103 | 1000

Mean purity (95% conf. int.): 0.86495 (0.83-0.91)

Median number of iterations (95% conf. int.): 4 (3-7)



## B: perturbation, 2 clusters

In [98]:
!python kmeans_runner.py 2 perturbation 100

Bootstrap replications: 100 / 100
Cluster type: perturbation

Number of clusters: 2


Proportions of classes in clusters are based on the last bootstrap replicate.

Cluster | Human | Cyborg | Robot | Spammer |  N  
-------------------------------------------------
      0 |  1.00 |   0.03 |  0.09 |    0.89 |  850
      1 |  0.00 |   0.97 |  0.91 |    0.11 |  150
-------------------------------------------------
      N |   752 |     91 |    54 |     103 | 1000

Mean purity (95% conf. int.): 0.83414 (0.82-0.84)

Median number of iterations (95% conf. int.): 3 (2-4)



## C: perturbation, 4 clusters

In [100]:
!python kmeans_runner.py 4 perturbation 100

Bootstrap replications: 100 / 100
Cluster type: perturbation

Number of clusters: 4


Proportions of classes in clusters are based on the last bootstrap replicate.

Cluster | Human | Cyborg | Robot | Spammer |  N  
-------------------------------------------------
      0 |  0.00 |   0.63 |  0.06 |    0.00 |   60
      1 |  0.03 |   0.36 |  0.67 |    0.22 |  111
      2 |  0.97 |   0.01 |  0.15 |    0.78 |  822
      3 |  0.00 |   0.00 |  0.13 |    0.00 |    7
-------------------------------------------------
      N |   752 |     91 |    54 |     103 | 1000

Mean purity (95% conf. int.): 0.86188 (0.83-0.91)

Median number of iterations (95% conf. int.): 4 (3-7)



## D: trained 

As randomness is not involved in the centroid generation, bootstrapping is not necessary to estimate the purity scores and iteration numbers.

In [96]:
!python kmeans_runner.py 4 trained 1

Bootstrap replications: 1 / 1
Cluster type: trained

Number of clusters: 4


Proportions of classes in clusters are based on the last bootstrap replicate.

Cluster | Human | Cyborg | Robot | Spammer |  N  
-------------------------------------------------
      0 |  0.98 |   0.02 |  0.06 |    0.19 |  763
      1 |  0.00 |   0.63 |  0.06 |    0.00 |   60
      2 |  0.00 |   0.35 |  0.89 |    0.07 |   90
      3 |  0.01 |   0.00 |  0.00 |    0.74 |   87
-------------------------------------------------
      N |   752 |     91 |    54 |     103 | 1000

Mean purity (95% conf. int.): 0.919 (0.92-0.92)

Median number of iterations (95% conf. int.): 2 (2-2)



## Discussion

Choosing centroids from a uniform distribution implies in a Bayesian sense that we do not have any prior information on the final centroids. Therefore, it is surprizing that that the perturbed global aggregate centroids return the same purity and same number of iterations as the uniformly chosen one. However, we have not tested the accuracy, recall or other measures of classification, but we can surmise that the perturbed global aggregates might outperform the uniform one according to these measures. 

It is also possible to observe that identifying more clusters require more iterations in general. However, if the cluster centroids are already trained as in the trained centroid example, the convergence is fast, and the stopping criterion is reached quickly in 2 iteratiions with 4 clusters. The purity of the trained clustering algorithm is the highest as well, 10% higher than the one of the 2 cluster perturbed and about 5% higher than the ones of the 4 cluster perturbed and uniformly chosen centroids.