## spark examples

### tf-idf

In [93]:
import sys
from operator import add
from pyspark import SparkContext
from math import log

def get_word_length(text):
    
    lines = sc.textFile(text)
    return lines.map(lambda s: len(s.split())).reduce(lambda a, b: a + b)

def tf(text):
    n = get_word_length(text)
    lines = sc.textFile(text, 1)
    counts = lines.flatMap(lambda x: x.split(' ')) \
        .map(lambda x: (x, 1./n)).reduceByKey(add)
    output = counts.collect()
    return dict(output)

def inverted_tf(word, texts):
    binary_docs = []
    for t in texts:
        lines = sc.textFile(text)
        has_word = lines.filter(lambda x: word in x).count()
        binary_docs.append(has_word)
    return sum(binary_docs)

def idf(word, texts):
    return log(float(len(texts)) / (1 + inverted_tf(word, texts)))

def tfidf(word, text, texts):
    return tf(text).get(word, 0) * idf(word, texts)

### testing it out

In [107]:
text = 'd_0'
texts = ["data_files/d_{}".format(str(i)) for i in range(10)]
sample_words =  ['and', 'set', 'says', 'give', 'it', 'cowhide', \
                 'a-growling', 'tore', 'minute,', 'aint', 'there', \
                 'better', 'then', 'you', 'dandy', 'though', 'a', 'he', \
                 'sweet-scented', 'a-mumbling', 'up', 'something']

In [108]:
for s in sample_words:
    print "tf-idf of {0} is {1}".format(s, tfidf(s, text, texts))


tf-idf of and is -0.00465113677445
tf-idf of set is -0.000142965269706
tf-idf of says is -8.57791618239e-05
tf-idf of give is -0.000114372215765
tf-idf of it is -0.00138199760716
tf-idf of cowhide is -9.53101798043e-06
tf-idf of a-growling is -9.53101798043e-06
tf-idf of tore is -9.53101798043e-06
tf-idf of minute, is -3.81240719217e-05
tf-idf of aint is 0.0
tf-idf of there is -0.000295461557393
tf-idf of better is -2.85930539413e-05
tf-idf of then is -0.000400302755178
tf-idf of you is -0.000810136528337
tf-idf of dandy is -0.0
tf-idf of though is -5.71861078826e-05
tf-idf of a is -0.00246853365693
tf-idf of he is -0.00150590084091
tf-idf of sweet-scented is -9.53101798043e-06
tf-idf of a-mumbling is -9.53101798043e-06
tf-idf of up is -0.000524205988924
tf-idf of something is -0.000123903233746


Ok, so using texts that are all from the same corpus doesn't give the best demonstration of TF-IDF. But it's nice to do some neat map-reducing in any case.

Will come back when I find some cool texts to demonstrate on further.

### Calculate Pi

This one is straight from the docs, but I'll be making some modifications soon.

In [109]:
import time
import sys
from random import random
from operator import add

from pyspark import SparkContext

def pi():
    
    partitions = 3
    n = 1000000 * partitions

    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 < 1 else 0

    count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))

    sc.stop()

pi()

Pi is roughly 3.142999
