## Exercise 1
### - Given a text file of 10,000 lines, each line contains a pair of (key,value)
### - Calculate the average value for each key
### - Apply groupByKey() and reuceByKey() functions
Compare?

### Initialize Spark

In [1]:
import findspark
findspark.init()

### Generate text file with 10,000 lines of (key,value)

In [11]:
# generate a text file of 10,000 lines, each line contains a pair of (key, value) 
# where key is a random integer between 1 and 1000, and value is a random integer between 1 and 10000
# the key-value pairs are separated by a tab character
import random
num_lines = 10000

file_name = "data/keyValue.txt"
with open(file_name, "w") as f:
    for i in range(num_lines):
        key = random.randint(1, 1000)
        value = random.randint(1, 10000)
        f.write(f"{key},{value}\n")

print(f"Generated {num_lines} lines of key-value pairs in {file_name}")

Generated 10000 lines of key-value pairs in data/keyValue.txt


**Create a Spark Session**

In [12]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("KeyValuePairs") \
    .getOrCreate()


**Load the dataset**

In [13]:
keyValue_rdd = spark.sparkContext.textFile("data/keyValue.txt")

In [14]:
keyValue_rdd.take(5)

['39,570', '619,4718', '255,5145', '618,4235', '3,5922']

**Parse the pairs**

In [17]:
pairs = keyValue_rdd.map(lambda line: tuple(map(int, line.split(','))))


In [18]:
pairs.take(10)

[(39, 570),
 (619, 4718),
 (255, 5145),
 (618, 4235),
 (3, 5922),
 (850, 2862),
 (630, 5102),
 (198, 5764),
 (220, 1117),
 (535, 2086)]

#### Calculate average value for each key

In [19]:
# Group by key
grouped = pairs.groupByKey()

In [21]:
averages_groupByKey = grouped.mapValues(lambda values: sum(values) / len(values))

In [23]:
# Collect and print the results
averages_groupByKey_result = averages_groupByKey.collect()
print("Averages using groupByKey():", averages_groupByKey_result)

Averages using groupByKey(): [(618, 5802.9), (850, 2793.0), (630, 4732.25), (198, 6160.4), (220, 3873.0714285714284), (650, 5607.3), (142, 4902.777777777777), (730, 3856.5), (426, 5459.0), (272, 5827.181818181818), (616, 3673.4166666666665), (744, 5723.846153846154), (704, 5131.125), (840, 3159.5), (38, 4802.727272727273), (26, 5847.0625), (258, 4062.5), (658, 5612.714285714285), (964, 4736.7692307692305), (574, 3940.4285714285716), (86, 5831.833333333333), (340, 6871.6), (936, 3077.9), (876, 4626.714285714285), (494, 4754.9), (754, 5988.1), (674, 2733.25), (456, 4933.714285714285), (706, 4889.1875), (668, 6297.692307692308), (462, 6487.5), (642, 3883.7272727272725), (102, 5579.0), (608, 4133.0), (862, 5915.8), (300, 4220.692307692308), (686, 4990.692307692308), (122, 5169.666666666667), (108, 4725.857142857143), (306, 5164.25), (772, 4754.583333333333), (666, 4320.7), (30, 4641.571428571428), (210, 5472.533333333334), (334, 5684.3), (292, 5774.0), (314, 4699.153846153846), (416, 5307.

In [25]:
# Reduce by key
sum_counts = pairs.mapValues(lambda x: (x, 1))\
    .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))

averages_reduceByKey = sum_counts.mapValues(lambda x: x[0]/x[1])

averages_reduceByKey_result = averages_reduceByKey.collect()
print("Averages using reduceByKey():", averages_reduceByKey_result)

Averages using reduceByKey(): [(618, 5802.9), (850, 2793.0), (630, 4732.25), (198, 6160.4), (220, 3873.0714285714284), (650, 5607.3), (142, 4902.777777777777), (730, 3856.5), (426, 5459.0), (272, 5827.181818181818), (616, 3673.4166666666665), (744, 5723.846153846154), (704, 5131.125), (840, 3159.5), (38, 4802.727272727273), (26, 5847.0625), (258, 4062.5), (658, 5612.714285714285), (964, 4736.7692307692305), (574, 3940.4285714285716), (86, 5831.833333333333), (340, 6871.6), (936, 3077.9), (876, 4626.714285714285), (494, 4754.9), (754, 5988.1), (674, 2733.25), (456, 4933.714285714285), (706, 4889.1875), (668, 6297.692307692308), (462, 6487.5), (642, 3883.7272727272725), (102, 5579.0), (608, 4133.0), (862, 5915.8), (300, 4220.692307692308), (686, 4990.692307692308), (122, 5169.666666666667), (108, 4725.857142857143), (306, 5164.25), (772, 4754.583333333333), (666, 4320.7), (30, 4641.571428571428), (210, 5472.533333333334), (334, 5684.3), (292, 5774.0), (314, 4699.153846153846), (416, 5307

In [27]:
# Compare the results
groupByKey_dict = dict(averages_groupByKey_result)
reduceByKey_dict = dict(averages_reduceByKey_result)

comparison = [(key, groupByKey_dict[key], reduceByKey_dict[key]) for key in groupByKey_dict if groupByKey_dict[key] != reduceByKey_dict[key]]

if not comparison:
    print("The results are the same.")
else:
    print("Differences found:", comparison)

The results are the same.
