## Spark Lab 2

#### Problem:

Swami's ten discourses on different subjects were downloaded from Radiosai as text files.
These were preprocessed to remove all punctuations and an identifer is added to each line to indicate the discourse number.
The format of the text is <id>:<line>
All these are zipped and available as data.zip
Unzip data.zip to your working folder

Treat each discourse as a set of words (case insensitive) and use Jaccard similarity as the measure for finding similarity between two discourses. Identify the pair of discourses that have the highest and the lowest similarity along with the actual similarity measure.
Jaccard Similarity (D1,D2) = |Intersection(D1,D2)| / |Union(D1,D2)|  [|S| - indicates cardinality of S]

Complete the following:
a) Load all discourses into an RDD for suitable further processing
b) Find Jaccard similarity of a pair
c) Find max and min pair


In [1]:
## setup
import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find()
from pyspark import SparkContext, SparkConf

conf = pyspark.SparkConf().setAppName('Lab2').setMaster('local')
sc = pyspark.SparkContext(conf=conf)

#### a) Load all discourses into an RDD for suitable further processing 

In [2]:
## Paths to all discourses
paths = ','.join(["data/discourses/discourse"+str(i)+".txt" for i in range(1,11)])
paths

'data/discourses/discourse1.txt,data/discourses/discourse2.txt,data/discourses/discourse3.txt,data/discourses/discourse4.txt,data/discourses/discourse5.txt,data/discourses/discourse6.txt,data/discourses/discourse7.txt,data/discourses/discourse8.txt,data/discourses/discourse9.txt,data/discourses/discourse10.txt'

In [3]:
## loading the data
rdd = sc.textFile(paths)
rdd.take(2)

['1:Manasu nirmalambe manchiki margambu Pure mind leads to goodness Manasu nirmalambe mahitha shakthi Pure mind is the mighty power Nirmalambu manase neeradi mutyavu Pure mind is the pearl of the great ocean of life One needs to dwell on the pearl from where does one get the pearl Is it found on trees Does it crop out of the earth No it is got from ocean alone What is this ocean The ocean is the ocean of life In this very ocean of life one can find the pearls The body is like the shell the pearl of atma resides in the shell called the body. The pearl is white and shines bright That is atma nirgunam niranjanam sanathanam nikethanam nithya sudha budha muktha nirmala swaroopinam the pure one There is no other entity which is as effulgent as pure and as sacred as the atma Though atma does not have legs nothing can move faster than it The atma does not have hands but it can hold everything. The atma does not have eyes there is nothing that it cannot see Love is the only path to understand a

In [5]:
## Splitting the sentences into set of words
rdd2 = rdd.map(lambda a: (a.split(':')[0],set(a.split(':')[1].lower().strip().split(" ")) ) )
rdd2.count()

10

#### b) Find Jaccard similarity of a pair

-> Using cartesian

In [101]:
## Cartesian function to generate pairs of documents
rdd3 = rdd2.cartesian(rdd2)
rdd3.take(2)

[(('1',
   {'all',
    'alone',
    'and',
    'are',
    'as',
    'atma',
    'body',
    'body.',
    'bright',
    'budha',
    'but',
    'called',
    'can',
    'cannot',
    'compassion',
    'crop',
    'describe',
    'does',
    'dwell',
    'earth',
    'effulgent',
    'entity',
    'everything.',
    'exists',
    'eyes',
    'faster',
    'find',
    'fluid',
    'fluids',
    'found',
    'from',
    'get',
    'god',
    'goodness',
    'got',
    'great',
    'hands',
    'has',
    'have',
    'hold',
    'in',
    'is',
    'it',
    'karuna',
    'leads',
    'legs',
    'life',
    'like',
    'live',
    'love',
    'mahitha',
    'manase',
    'manasu',
    'manchiki',
    'margambu',
    'mighty',
    'mind',
    'misery',
    'move',
    'muktha',
    'mutyavu',
    'nava',
    'needs',
    'neeradi',
    'nikethanam',
    'nine',
    'niranjanam',
    'nirgunam',
    'nirmala',
    'nirmalambe',
    'nirmalambu',
    'nithya',
    'no',
    'not',
    'nothin

In [102]:
## Filtering duplicate pairs
rdd4 = rdd3.filter(lambda a: a[0][0]>a[1][0])
rdd4.count()

45

In [103]:
## Calculating JSIMs for each pair
rdd5 = rdd4.map(lambda a: ((a[0][0], a[1][0]), len(a[0][1].intersection(a[1][1]))/len(a[0][1].union(a[1][1])) ))
rdd5.take(2)

[(('2', '1'), 0.1276595744680851), (('2', '10'), 0.13636363636363635)]

-> Without using cartesian

In [75]:
## Unique Hash value generator for a pair.
def hashkey(a,b):
    s=[int(a),int(b)]
    s.sort()
    p=','.join([str(i) for i in s])
    return p

In [105]:
## replicating each record into no. of pairs it can form
rdd3 = rdd2.flatMap(lambda a: [ (hashkey(i,a[0]),a[1]) for i in range(1,11) if i!=int(a[0])  ] )


In [106]:
## initiating reduce operation so that each record move to its pair.
rdd5= rdd3.reduceByKey(lambda a,b: len(a.intersection(b))/len(a.union(b)) )

In [97]:
rdd4.count()

45

In [107]:
rdd5.take(5)

[('1,2', 0.1276595744680851),
 ('2,8', 0.13071895424836602),
 ('3,5', 0.13690476190476192),
 ('4,10', 0.13071895424836602),
 ('6,7', 0.11764705882352941)]

-- *Catersian method takes lots of time compared to non-cartesian method.*

#### c) Find max and min pair

In [108]:
max_pair = rdd5.reduce(lambda a,b: a if a[1]>b[1] else b)
max_pair

('1,5', 0.4557823129251701)

  `Document 1 and 5 are most similar`

In [100]:
min_pair = rdd5.reduce(lambda a,b: a if a[1]<b[1] else b)
min_pair

('8,10', 0.09202453987730061)

  `Document 8 and 10 are least similar`