##Using the MRJob Class below  calculate the  KL divergence of the following two objects.

## Pairwise similarity using K-L divergence

In probability theory and information theory, the Kullback–Leibler divergence 
(also information divergence, information gain, relative entropy, KLIC, or KL divergence) 
is a non-symmetric measure of the difference between two probability distributions P and Q. 
Specifically, the Kullback–Leibler divergence of Q from P, denoted DKL(P\‖Q), 
is a measure of the information lost when Q is used to approximate P:

For discrete probability distributions P and Q, 
the Kullback–Leibler divergence of Q from P is defined to be

    + KLDistance(P, Q) = Sum_over_item_i (P(i) log (P(i) / Q(i))      

In the extreme cases, the KL Divergence is 1 when P and Q are maximally different
and is 0 when the two distributions are exactly the same (follow the same distribution).

For more information on K-L Divergence see:

    + [K-L Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)

For the next three question we will use an MRjob class for calculating pairwise similarity 
using K-L Divergence as the similarity measure:

Job 1: create inverted index (assume just two objects)
Job 2: calculate/accumulate the similarity of each pair of objects using K-L Divergence

Download the following notebook and then fill in the code for the first reducer to calculate 
the K-L divergence of objects (letter documents) in line1 and line2, i.e., KLD(Line1||line2).

Here we ignore characters which are not alphabetical. And all alphabetical characters are lower-cased in the first mapper.

http://nbviewer.ipython.org/urls/dl.dropbox.com/s/9onx4c2dujtkgd7/Kullback%E2%80%93Leibler%20divergence-MIDS-Midterm.ipynb
https://www.dropbox.com/s/zr9xfhwakrxz9hc/Kullback%E2%80%93Leibler%20divergence-MIDS-Midterm.ipynb?dl=0

In [3]:
%%writefile kltext.txt
1.Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from large volumes of data in various forms (data in various forms, data in various forms, data in various forms), either structured or unstructured,[1][2] which is a continuation of some of the data analysis fields such as statistics, data mining and predictive analytics, as well as Knowledge Discovery in Databases.
2.Machine learning is a subfield of computer science[1] that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.[1] Machine learning explores the study and construction of algorithms that can learn from and make predictions on data.[2] Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions,[3]:2 rather than following strictly static program instructions.

Overwriting kltext.txt


##MRjob class for calculating pairwise similarity using K-L Divergence as the similarity measure

Job 1: create inverted index (assume just two objects) <P>
Job 2: calculate the similarity of each pair of objects 

In [4]:
import numpy as np
np.log(3)

1.0986122886681098

In [59]:
%%writefile kldivergence.py
from mrjob.job import MRJob
import re
import numpy as np
class kldivergence(MRJob):
    def mapper1(self, _, line):
        index = int(line.split('.',1)[0])
        letter_list = re.sub(r"[^A-Za-z]+", '', line).lower()
        count = {}
        for l in letter_list:
            if count.has_key(l):
                count[l] += 1
            else:
                count[l] = 1
        for key in count:
            yield key, [index, count[key]*1.0/len(letter_list)]


    def reducer1(self, key, values):
        yield key, (values)
    
#     def reducer2(self, key, values):
#         kl_sum = 0
#         for value in values:
#             kl_sum = kl_sum + value
#         yield None, kl_sum
            
    def steps(self):
        return [self.mr(mapper=self.mapper1,
                         reducer=self.reducer1)
#                 self.mr(reducer=self.reducer2)
               ]

if __name__ == '__main__':
    kldivergence.run()

Overwriting kldivergence.py


In [60]:
%reload_ext autoreload
%autoreload 2
from kldivergence import kldivergence
mr_job = kldivergence(args=['kltext.txt'])
with mr_job.make_runner() as runner: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        print mr_job.parse_output_line(line)

TypeError: 'generator' object has no attribute '__getitem__'