# Scripts du cours Spark SQL

Attention, si vous utilisez la version comprenant Jupyter (utilisation de docker-compose-jupyter.yml pour les Mac par exemple) , utilisez la configuration suivante (cellule désactivée) de Spark à la place de la suivante.



In [1]:
### Configuration Mac : utilisation de notebook Jupyter

from pyspark import SparkContext, SparkConf

conf = SparkConf() \
    .setAppName('SparkApp') \
    .setMaster('spark://spark:7077') \
    .set("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4") # utilisé pour le stockage 

sc = SparkContext(conf=conf)

from pyspark.sql import SQLContext
# Créer un SQLContext pour les opérations SQL
sql_context = SQLContext(sc)

minio_ip_address = "minio"

:: loading settings :: url = jar:file:/opt/conda/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5d480b31-4cfb-4458-941c-7389e65eb942;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar ...
	[SUCCESSFUL ] org.apache.hadoop#hadoop-aws;3.3.4!hadoop-aws.jar (427ms)
downloading https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar ...
	[SUCCESSFUL ] com.amazonaws#aws-java-sdk-bundle;1.12.262!aws-java-sdk-bundle.jar (137319ms)
downloading https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar ...
	[SUCCESSFUL 

## DataFrames vs Views

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg

# Create DataFrame
data = [("Brooke", 20), ("Brooke", 25), ("Denny", 31), ("Jules", 30), ("TD", 35)]
df = sql_context.createDataFrame(data, ["name", "age"])
# Calculate average age by name using DataFrame API
avgDF = df.groupBy("name").agg(avg("age"))
avgDF.show()# Create a temporary view to use SQL
df.createOrReplaceTempView("people")
# Calculate average age by name using SQL
sqlDF = sql_context.sql("""
SELECT name, avg(age) as avg_age
FROM people
GROUP BY name
""")
sqlDF.show()

                                                                                

+------+--------+
|  name|avg(age)|
+------+--------+
|Brooke|    22.5|
| Jules|    30.0|
|    TD|    35.0|
| Denny|    31.0|
+------+--------+

+------+-------+
|  name|avg_age|
+------+-------+
|Brooke|   22.5|
| Jules|   30.0|
|    TD|   35.0|
| Denny|   31.0|
+------+-------+



## Colonnes et expressions sur les colonnes

In [3]:
from pyspark.sql.functions import col, expr
print(df.columns)

print(df["name"])

print(df.select(col("age") + 1).first())

print(df.select(df.age + 1).first())

print(df.select(expr("age + 1")).first())

['name', 'age']
Column<'name'>


                                                                                

Row((age + 1)=21)


                                                                                

Row((age + 1)=21)
Row((age + 1)=21)


                                                                                

## Lignes

In [4]:
from pyspark.sql import Row

first_row = df.first()
print(first_row)  # Output: Row(name='Brooke', age=20)

tail_rows = df.tail(2)
print(tail_rows)  # Output: [Row(name='Jules', age=30), Row(name='TD', age=35)]

row = Row("Brooke", 20)
print(row)  # Output: Row(Brooke=20)

print(row[0])  # Output: Brooke
print(row[1])  # Output: 20

# Convert sequence to DataFrame
seq_df = sql_context.createDataFrame([("Brooke", 20)], ["name", "age"])
seq_df.show()

Row(name='Brooke', age=20)
[Row(name='Jules', age=30), Row(name='TD', age=35)]
<Row('Brooke', 20)>
Brooke
20
+------+---+
|  name|age|
+------+---+
|Brooke| 20|
+------+---+



## Projection et sélection

In [5]:
# Apply transformations
result = (
    df.select("name")             # Projection: Select only the "name" column
      .where("age <= 30")         # Selection: Filter rows where "age" is less than or equal to 30
      .distinct()                 # Eliminate duplicates
).show()


+------+
|  name|
+------+
|Brooke|
| Jules|
+------+



## Renommage, ajout et suppression de colonnes

In [6]:
df.withColumnRenamed("name", "nom").show(1)

from pyspark.sql.functions import floor, col
df.withColumn("dizaine", floor(col("age") / 10))  \
    .drop("age") \
    .show(1)

+------+---+
|   nom|age|
+------+---+
|Brooke| 20|
+------+---+
only showing top 1 row



[Stage 17:>                                                         (0 + 1) / 1]

+------+-------+
|  name|dizaine|
+------+-------+
|Brooke|      2|
+------+-------+
only showing top 1 row



                                                                                

## Agrégats

In [7]:
from pyspark.sql.functions import max
result = (
    df.groupBy("name")       # Group by "name"
      .agg(max("age").alias("max_age"))  # Calculate max "age"
).show()
from pyspark.sql.functions import min, max, avg
df.select(min("age"), max("age"), avg("age")).show()

                                                                                

+------+-------+
|  name|max_age|
+------+-------+
|Brooke|     25|
| Jules|     30|
|    TD|     35|
| Denny|     31|
+------+-------+

+--------+--------+--------+
|min(age)|max(age)|avg(age)|
+--------+--------+--------+
|      20|      35|    28.2|
+--------+--------+--------+



In [8]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import Row

# Define schema with StructType and StructField
schema = StructType([
    StructField("nom", StringType(), nullable=False),
    StructField("age", IntegerType(), nullable=False)
])

# Create data
data = [Row("Brooke", 20)]

# Create DataFrame
df = sql_context.createDataFrame(data, schema)

# Show DataFrame
df.show()

+------+---+
|   nom|age|
+------+---+
|Brooke| 20|
+------+---+



## Files

In [9]:
# Create a sample CSV file
import os

csv_content = """name,age
Brooke,20
Jules,30
Cameron,35"""

with open("sample.csv", "w") as file:
    file.write(csv_content)

print("Sample CSV file created!")

Sample CSV file created!


In [10]:


sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", f"http://{minio_ip_address}:9000")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "root")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "password")
sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false")

from minio import Minio
client_minio = Minio(
    f"{minio_ip_address}:9000",
    access_key="root",
    secret_key="password",
    secure=False
)

# Création du bucket tp6
if client_minio.bucket_exists("tp6") == False:
    client_minio.make_bucket("tp6")
client_minio.fput_object("tp6", "sample.csv", "sample.csv")    

<minio.helpers.ObjectWriteResult at 0x7fea50e0e690>

In [11]:
# Read CSV file
df = (
    sql_context.read
    .format("csv")                  # Specify format as CSV
    .option("inferSchema", "true")  # Infer the schema automatically
    .option("header", "true")       # The file contains a header
    .option("mode", "FAILFAST")     # Fail if there are any errors
    .option("nullValue", "")        # Treat empty strings as null values
    .load("s3a://tp6/sample.csv")                 # Path to the CSV file
)

# Show the DataFrame
df.show()

25/03/04 09:02:53 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

+-------+---+
|   name|age|
+-------+---+
| Brooke| 20|
|  Jules| 30|
|Cameron| 35|
+-------+---+



In [12]:
from pyspark.sql.functions import col

# Example: Add a new column and filter rows
df_transformed = (
    df.withColumn("age_in_5_years", col("age") + 5)  # Add 5 to the age
      .filter(col("age") < 30)                      # Keep rows where age is less than 30
)

# Show the transformed DataFrame
df_transformed.show()

+------+---+--------------+
|  name|age|age_in_5_years|
+------+---+--------------+
|Brooke| 20|            25|
+------+---+--------------+



In [13]:
(df_transformed.coalesce(1).write           # execute on one task
    .format("csv")                          # output format 
    .option("header", "true")               # header present
    .mode("overwrite")                      # overwrite if exist
    .save("s3a://tp6/transformed_sample"))  # directory name

25/03/04 09:03:17 WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
                                                                                

Vous pouvez aller vérifier sur [Minio](http://localhost:19001/browser/tp6/) (pour rappel utilisateur et mot de passe : root / password ) dans le bucket TP6 que le fichier a bien été généré (dans le dossier `transformed_sample`). 