# Preprocessing et cleaning des données
L'objectif de ce notebook est de nettoyer et process la donnée avant traitement et anayse. 

### Consignes:

Use Spark to clean and preprocess the data. Key steps include:
- Handling missing values.
- Removing duplicates.
- Normalizing data formats (e.g., date formats, categorical variables).
- Filtering irrelevant data

In [27]:
!pip install pyspark
!pip install pyspark[sql]
!pip install pyspark[pandas_on_spark] plotly
!pip install pandas



In [28]:
import pandas as pd
# import pyspark.pandas as ps
from pyspark.sql import SparkSession


## 1. Data loading

First we create a Spark session

In [29]:
# Créer une session Spark
spark = SparkSession.builder.appName("BigDataProject").getOrCreate()

In [30]:
# Load CSV data
file_path = "../ecommerce_data_with_trends.csv"
data = spark.read.csv(file_path, header=True, inferSchema=True)

# Save DataFrame as temporary SQL table
data.createOrReplaceTempView("transactions")

# Show first lines
spark.sql("SELECT * FROM transactions LIMIT 5").show()

+--------------------+--------------------+-----------+--------------+-----------------+-------------+--------------------+--------------------+------+--------+------------+
|      transaction_id|           timestamp|customer_id| customer_name|             city|customer_type|        product_name|            category| price|quantity|total_amount|
+--------------------+--------------------+-----------+--------------+-----------------+-------------+--------------------+--------------------+------+--------+------------+
|TX_89a20095-f7be-...|2023-10-30 03:01:...|       6933|    David Hays|      New Sabrina|          B2C|Furniture Product_10|Home & Kitchen > ...|246.08|       4|      984.32|
|TX_a6b15a50-47b9-...|2023-10-30 03:06:...|       9328| Adam Oconnell|East Katherineton|          B2C|Non-Fiction Produ...| Books > Non-Fiction| 792.3|       4|      3169.2|
|TX_abdde2cb-3752-...|2023-10-30 03:06:...|       6766|   Jerry Brown|         Lukefort|          B2B|   Bedding Product_1|Home & 

## 2. Data Preprocessing

In [31]:
# Remove rows containing NULL values
data_cleaned = spark.sql("""
    SELECT *
    FROM transactions
    WHERE transaction_id IS NOT NULL
      AND timestamp IS NOT NULL
      AND customer_id IS NOT NULL
      AND total_amount IS NOT NULL
""")

data_cleaned.createOrReplaceTempView("transactions_cleaned")


In [32]:
# Delete duplicate rows based on unique transaction id
data_cleaned_no_duplicates = spark.sql("""
    SELECT DISTINCT *
    FROM transactions_cleaned
""")

data_cleaned_no_duplicates.createOrReplaceTempView("transactions_no_duplicates")


In [33]:
# Filtrer les transactions non pertinentes
data_filtered = spark.sql("""
    SELECT *
    FROM transactions_no_duplicates
    WHERE total_amount > 0
""")

data_filtered.createOrReplaceTempView("transactions_filtered")


## 3. Data Cleaning

In [34]:
# Timestamp format normalization
# Convert timestamp column o format yyyy-MM-dd HH:mm:ss
data_normalized = spark.sql("""
    SELECT *,
           CAST(timestamp AS TIMESTAMP) AS normalized_timestamp
    FROM transactions_filtered
""")

data_normalized.createOrReplaceTempView("transactions_normalized")


In [35]:
# Normalize customer types (uppercase, space removal)
data_categorized = spark.sql("""
    SELECT *,
           UPPER(TRIM(customer_type)) AS normalized_customer_type
    FROM transactions_normalized
""")

data_categorized.createOrReplaceTempView("transactions_categorized")


## 4. Export processed and cleaned data

In [36]:
output_path = "../preprocessed_data"
data_filtered.write.csv(output_path, header=True)

print(f"Preprocessed and cleaned data saved at {output_path}")


Preprocessed and cleaned data saved at ../preprocessed_data
