# Spark Exercise

Apache Spark is an excellent tool for data engineering projects due to its robust ability to process large-scale data efficiently through distributed computing. Spark's in-memory processing capabilities significantly enhance the speed of data operations, making it ideal for handling big data workloads. It supports various data sources and formats, offering versatility in data ingestion and transformation. Additionally, Spark's rich API supports multiple programming languages such as Python, Java, and Scala, catering to diverse developer preferences. Its ecosystem, which includes libraries for SQL, machine learning, and graph processing, provides a comprehensive suite for building complex data pipelines and analytics, making it a powerful and flexible choice for data engineering tasks.

Use Python, ```pyspark``` and ```pandas``` to explore Apache Spark RDD and DataFrame:

# Spark RDD

Spark RDD (Resilient Distributed Dataset) is a fundamental data structure in Apache Spark that enables fault-tolerant, distributed processing of large datasets across multiple nodes in a cluster. Spark RDDs provide a higher-level abstraction for performing distributed data processing tasks, including both map (transformations) and reduce (aggregations) operations.

## Import Necessary Libraries

In [1]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
import pandas as pd

## Spark Context and Session
Initialize Spark Context and Spark Session

In [4]:
spark = (
    SparkSession
    .builder
    .appName("LoadClosingOddsRDD")
    .config("spark.master", "local[*]")
    .getOrCreate()
)

sc = spark.sparkContext

## Load Data into RDD

In [11]:
csv_path = "closing_odds.csv"
df = pd.read_csv(
    csv_path,
    compression="gzip",      
    encoding="utf-8",        
)

data_tuples = list(df.itertuples(index=False, name=None))

rdd = sc.parallelize(data_tuples)          

## Map Operation

Split data into individual parts and create key-value pairs

In [12]:
rdd_kv = rdd.map(lambda tup: (tup[0], tup[1:]))
print("key-value pairs:")
for kv in rdd_kv.take(5):
    print(kv)

key-value pairs:
(170088, ('England: Premier League', '2005-01-01', 'Liverpool', 0, 'Chelsea', 1, 2.9944, 3.1944, 2.2256, 3.2, 3.25, 2.29, 'Paddy Power', 'Sportingbet', 'Expekt', 9, 9, 9))
(170089, ('England: Premier League', '2005-01-01', 'Fulham', 3, 'Crystal Palace', 1, 1.9456, 3.2333, 3.6722, 2.04, 3.3, 4.15, 'Pinnacle Sports', 'bet-at-home', 'Expekt', 9, 9, 9))
(170090, ('England: Premier League', '2005-01-01', 'Aston Villa', 1, 'Blackburn', 0, 1.8522, 3.2611, 4.0144, 2.0, 3.4, 4.5, 'Pinnacle Sports', 'Paddy Power', 'Sportingbet', 9, 9, 9))
(170091, ('England: Premier League', '2005-01-01', 'Bolton', 1, 'West Brom', 1, 1.6122, 3.4133, 5.4722, 1.67, 3.57, 6.27, 'Coral', 'Pinnacle Sports', 'Pinnacle Sports', 9, 9, 9))
(170092, ('England: Premier League', '2005-01-01', 'Charlton', 1, 'Arsenal', 3, 5.9878, 3.4778, 1.5567, 7.0, 3.6, 1.62, 'Expekt', 'Paddy Power', 'bet365', 9, 9, 9))


## Reduce Operation

Reduce your key-value pairs

In [18]:
matches_per_league_rdd = (
    rdd_kv
    .map(lambda parts: (parts[1], 1))            
    .reduceByKey(lambda a, b: a + b)            
)

for league, count in matches_per_league_rdd.take(10):
    print(f"{league}: {count} games")

('England: Premier League', '2005-01-01', 'Aston Villa', 1, 'Blackburn', 0, 1.8522, 3.2611, 4.0144, 2.0, 3.4, 4.5, 'Pinnacle Sports', 'Paddy Power', 'Sportingbet', 9, 9, 9): 1 games
('England: Championship', '2005-01-01', 'Derby', 0, 'Cardiff', 1, 1.6875, 3.3625, 4.5188, 1.73, 3.5, 5.5, 'bet365', 'bet365', 'Coral', 8, 8, 8): 1 games
('England: Championship', '2005-01-01', 'QPR', 0, 'Brighton', 0, 1.8388, 3.2888, 3.7788, 1.9, 3.66, 4.03, 'Expekt', 'Pinnacle Sports', 'Pinnacle Sports', 8, 8, 8): 1 games
('England: Championship', '2005-01-01', 'Sheffield Utd', 0, 'Wigan', 2, 2.25, 3.2188, 2.7888, 2.4, 3.25, 3.0, 'Expekt', 'bet365', 'Ladbrokes', 8, 8, 8): 1 games
('England: Championship', '2005-01-01', 'Wolves', 1, 'Plymouth', 1, 1.8713, 3.2625, 3.675, 2.0, 3.4, 4.0, 'Pinnacle Sports', 'Pinnacle Sports', 'Sportingbet', 8, 8, 8): 1 games
('England: League One', '2005-01-01', 'Bristol City', 2, 'Peterborough', 0, 1.5378, 3.48, 5.6656, 1.57, 3.6, 6.54, 'bet365', 'bet365', 'Pinnacle Sports', 9

## Collect Results

Because of lazy evaluation, the map-reduce operation is performed only now. Show what you calculated.

In [None]:
matches_per_league_rdd.collect()

## Save Results

In [None]:
matches_per_league_rdd \
    .map(lambda kv: f"{kv[0]},{kv[1]}") \
    .saveAsTextFile("/mnt/data/output/matches_per_league_txt")

# Spark DataFrame

Spark DataFrame is a distributed collection of data organized into named columns, designed for efficient data manipulation and analysis in Apache Spark. It is used for various data processing tasks such as data ingestion, transformation, querying, and analysis in Apache Spark, providing a high-level abstraction that simplifies working with structured data.

## Load Data into DataFrame

In [None]:
# TODO

## View DataFrame Schema

In [None]:
# TODO

## View DataFrame Data

In [None]:
# TODO

## Filter Data

Performe a filter operation on a column

In [None]:
# TODO

## Group By and Aggregate

Performe a group by and aggregat operation

In [None]:
# TODO

## Save DataFrame to Parquet

In [None]:
# TODO