## Exercise 1
### - Given a text file of 10,000 lines, each line contains a pair of (key,value)
### - Calculate the average value for each key
### - Apply groupByKey() and reuceByKey() functions
Compare?

### Initialize Spark

In [37]:
import findspark
findspark.init()

### Generate text file with 10,000 lines of (key,value)

In [38]:
# generate a text file of 10,000 lines, each line contains a pair of (key, value) 
# where key is a random integer between 1 and 1000, and value is a random integer between 1 and 10000
# the key-value pairs are separated by a tab character
import random
num_lines = 10000

file_name = "data/keyValue.txt"
with open(file_name, "w+") as f:
    for i in range(num_lines):
        key = random.randint(1, 1000)
        value = random.randint(1, 10000)
        f.write(f"{key},{value}\n")

print(f"Generated {num_lines} lines of key-value pairs in {file_name}")

Generated 10000 lines of key-value pairs in data/keyValue.txt


**Create a Spark Session**

In [39]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("KeyValuePairs") \
    .getOrCreate()


**Load the dataset**

In [40]:
keyValue_rdd = spark.sparkContext.textFile("data/keyValue.txt")

In [41]:
keyValue_rdd.take(5)

['18,3159', '129,6019', '220,1520', '410,1511', '370,3914']

**Parse the pairs**

In [42]:
pairs = keyValue_rdd.map(lambda line: tuple(map(int, line.split(','))))


In [43]:
pairs.take(10)

[(18, 3159),
 (129, 6019),
 (220, 1520),
 (410, 1511),
 (370, 3914),
 (879, 2951),
 (899, 6874),
 (163, 5113),
 (171, 5550),
 (779, 6385)]

#### Calculate average value for each key

In [44]:
# Group by key
grouped = pairs.groupByKey()

averages_groupByKey = grouped.mapValues(lambda values: sum(values) / len(values))

# Collect and print the results
averages_groupByKey_result = averages_groupByKey.collect()
print("Averages using groupByKey():", averages_groupByKey_result)

Averages using groupByKey(): [(18, 4878.214285714285), (220, 4342.833333333333), (410, 4575.0), (370, 4966.0), (268, 7868.857142857143), (776, 4523.642857142857), (102, 4956.833333333333), (574, 4548.636363636364), (668, 5812.0), (740, 5214.333333333333), (616, 3479.3636363636365), (778, 6641.0), (920, 5305.666666666667), (422, 5734.777777777777), (674, 4716.0), (470, 4388.714285714285), (166, 4059.8333333333335), (484, 4850.7), (112, 6428.888888888889), (904, 4973.666666666667), (34, 7371.166666666667), (332, 4376.866666666667), (320, 5016.3), (734, 4304.166666666667), (156, 5929.0), (882, 4843.714285714285), (896, 5023.133333333333), (468, 3701.4285714285716), (780, 5126.538461538462), (692, 4573.0), (218, 5468.3125), (252, 4976.5), (572, 5284.5), (38, 4367.846153846154), (750, 4777.928571428572), (730, 5898.428571428572), (412, 4869.2), (506, 4506.5), (988, 3946.5833333333335), (230, 5190.555555555556), (500, 6073.090909090909), (908, 4091.222222222222), (636, 4569.461538461538), (9

In [45]:
# Reduce by key
sum_counts = pairs.mapValues(lambda x: (x, 1))\
    .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))

averages_reduceByKey = sum_counts.mapValues(lambda x: x[0]/x[1])

averages_reduceByKey_result = averages_reduceByKey.collect()
print("Averages using reduceByKey():", averages_reduceByKey_result)

Averages using reduceByKey(): [(18, 4878.214285714285), (220, 4342.833333333333), (410, 4575.0), (370, 4966.0), (268, 7868.857142857143), (776, 4523.642857142857), (102, 4956.833333333333), (574, 4548.636363636364), (668, 5812.0), (740, 5214.333333333333), (616, 3479.3636363636365), (778, 6641.0), (920, 5305.666666666667), (422, 5734.777777777777), (674, 4716.0), (470, 4388.714285714285), (166, 4059.8333333333335), (484, 4850.7), (112, 6428.888888888889), (904, 4973.666666666667), (34, 7371.166666666667), (332, 4376.866666666667), (320, 5016.3), (734, 4304.166666666667), (156, 5929.0), (882, 4843.714285714285), (896, 5023.133333333333), (468, 3701.4285714285716), (780, 5126.538461538462), (692, 4573.0), (218, 5468.3125), (252, 4976.5), (572, 5284.5), (38, 4367.846153846154), (750, 4777.928571428572), (730, 5898.428571428572), (412, 4869.2), (506, 4506.5), (988, 3946.5833333333335), (230, 5190.555555555556), (500, 6073.090909090909), (908, 4091.222222222222), (636, 4569.461538461538), (

In [46]:
# Compare the results
groupByKey_dict = dict(averages_groupByKey_result)
reduceByKey_dict = dict(averages_reduceByKey_result)

comparison = [(key, groupByKey_dict[key], reduceByKey_dict[key]) for key in groupByKey_dict if groupByKey_dict[key] != reduceByKey_dict[key]]

if not comparison:
    print("The results are the same.")
else:
    print("Differences found:", comparison)

The results are the same.
