## Analyse Exploratoire des Données (EDA)

#### Générer des statistiques descriptives globales avec .describe().show().

In [3]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("Prédiction de l'Attrition Client Bancaire")
    .getOrCreate()
)


In [4]:
df = spark.read.csv("../data/raw/dataset.csv", header=True, inferSchema=True)

In [6]:
df.describe().show()

+-------+------------------+-----------------+-------+-----------------+---------+------+------------------+------------------+-----------------+------------------+-------------------+-------------------+-----------------+-------------------+
|summary|         RowNumber|       CustomerId|Surname|      CreditScore|Geography|Gender|               Age|            Tenure|          Balance|     NumOfProducts|          HasCrCard|     IsActiveMember|  EstimatedSalary|             Exited|
+-------+------------------+-----------------+-------+-----------------+---------+------+------------------+------------------+-----------------+------------------+-------------------+-------------------+-----------------+-------------------+
|  count|             10000|            10000|  10000|            10000|    10000| 10000|             10000|             10000|            10000|             10000|              10000|              10000|            10000|              10000|
|   mean|            5000.5|

#### Identifier les valeurs manquantes par colonne via df.select.

In [42]:
from pyspark.sql import functions as F

cols = [F.sum(F.col(c).isNull().cast("int")).alias(c) for c in df.columns]

df_with_nulls = df.select(cols)
df_with_nulls.show()

+---------+----------+-------+-----------+---------+------+---+------+-------+-------------+---------+--------------+---------------+------+
|RowNumber|CustomerId|Surname|CreditScore|Geography|Gender|Age|Tenure|Balance|NumOfProducts|HasCrCard|IsActiveMember|EstimatedSalary|Exited|
+---------+----------+-------+-----------+---------+------+---+------+-------+-------------+---------+--------------+---------------+------+
|        0|         0|      0|          0|        0|     0|  0|     0|      0|            0|        0|             0|              0|     0|
+---------+----------+-------+-----------+---------+------+---+------+-------+-------------+---------+--------------+---------------+------+



DataFrame ne contient pas des valeurs manquantes

#### Détecter les outliers via la méthode des quantiles approximatifs (approxQuantile) ou IQR.

- ApproxQuantitle

In [62]:

for col in ['CustomerId', 'CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']:
    Q1, Q3 = df.approxQuantile(col, [0.25, 0.75], 0.01)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = df.filter((df[col] < lower_bound) | (df[col] > upper_bound))
    print(f"{col :20s} - {outliers.count()}")



CustomerId           - 0
CreditScore          - 17
Age                  - 359
Tenure               - 0
Balance              - 0
NumOfProducts        - 60
EstimatedSalary      - 0
