## Chargement et préparation des données.

### Initialiser Spark

In [1]:
from modules.spark import spark

In [60]:
from pyspark.sql import functions as F

- Chargement de données

In [2]:
df = spark.read.csv('../data/raw/DataCoSupplyChainDataset.csv', header=True, inferSchema=True)

### Effectuer un nettoyage initial.

In [9]:
df.printSchema()

root
 |-- Type: string (nullable = true)
 |-- Days for shipping (real): integer (nullable = true)
 |-- Days for shipment (scheduled): integer (nullable = true)
 |-- Benefit per order: double (nullable = true)
 |-- Sales per customer: double (nullable = true)
 |-- Delivery Status: string (nullable = true)
 |-- Late_delivery_risk: integer (nullable = true)
 |-- Category Id: integer (nullable = true)
 |-- Category Name: string (nullable = true)
 |-- Customer City: string (nullable = true)
 |-- Customer Country: string (nullable = true)
 |-- Customer Email: string (nullable = true)
 |-- Customer Fname: string (nullable = true)
 |-- Customer Id: integer (nullable = true)
 |-- Customer Lname: string (nullable = true)
 |-- Customer Password: string (nullable = true)
 |-- Customer Segment: string (nullable = true)
 |-- Customer State: string (nullable = true)
 |-- Customer Street: string (nullable = true)
 |-- Customer Zipcode: integer (nullable = true)
 |-- Department Id: integer (nullable = 

- Les dimensions de dataframes

In [53]:
print("Colonnes :", len(df.columns))
print("Lignes   :", df.count())

Colonnes : 53
Lignes   : 180519


- Afficher les 10 premiere lignes

In [36]:
df.show(10)

+--------+------------------------+-----------------------------+-----------------+------------------+-----------------+------------------+-----------+--------------+-------------+----------------+--------------+--------------+-----------+--------------+-----------------+----------------+--------------+--------------------+----------------+-------------+---------------+-----------+------------+------------+----------+-------------+-----------------+-----------------------+--------+----------------------+-------------------+------------------------+-------------+------------------------+-----------------------+-------------------+------+----------------+----------------------+--------------+---------------+---------------+-------------+---------------+-------------------+-------------------+--------------------+------------+-------------+--------------+--------------------------+--------------+
|    Type|Days for shipping (real)|Days for shipment (scheduled)|Benefit per order|Sales per 

- Supprimer les colonnes non pertinent

In [74]:
df_cleaned = df

delete_cols = [

    # Ids are irrelevant

    "Category Id",
    "Customer Id",
    "Department Id",
    "Order Customer Id",
    "Order Id",
    "Order Item Cardprod Id",
    "Order Item Id",
    "Product Card Id",
    "Product Category Id",

    # Irrelevant names

    "Customer City",
    "Customer Country",
    "Customer Email",
    "Customer Fname",
    "Customer Lname",
    "Customer Password",
    "Customer State",
    "Customer Street",
    "Customer Zipcode",

    # Product details are irrelevant

    "Product Description",
    "Product Image",
    "Product Name",

    # Duplicated columns

    "Benefit per order",
    "Order Item Product Price",
    "Sales per customer",

    # Calculated from other columns

    "Order Item Discount Rate",
    "Order Item Profit Ratio",
    "Sales",
    "Order Item Total",

    # All Values are 0

    "Product Status",

    # Others

    "Latitude",
    "Longitude",

    # Already have state, country and region

    "Order City",
    "Order Zipcode"
]

for col in delete_cols:
    df_cleaned = df_cleaned.drop(col)

In [75]:
df_cleaned.printSchema()

root
 |-- Type: string (nullable = true)
 |-- Days for shipping (real): integer (nullable = true)
 |-- Days for shipment (scheduled): integer (nullable = true)
 |-- Delivery Status: string (nullable = true)
 |-- Late_delivery_risk: integer (nullable = true)
 |-- Category Name: string (nullable = true)
 |-- Customer Segment: string (nullable = true)
 |-- Department Name: string (nullable = true)
 |-- Market: string (nullable = true)
 |-- Order Country: string (nullable = true)
 |-- order date (DateOrders): string (nullable = true)
 |-- Order Item Discount: double (nullable = true)
 |-- Order Item Quantity: integer (nullable = true)
 |-- Order Profit Per Order: double (nullable = true)
 |-- Order Region: string (nullable = true)
 |-- Order State: string (nullable = true)
 |-- Order Status: string (nullable = true)
 |-- Product Price: double (nullable = true)
 |-- shipping date (DateOrders): string (nullable = true)
 |-- Shipping Mode: string (nullable = true)



- Afficher le nombre total de colonnes aprés le nettoyage

In [76]:
len(df_cleaned.columns)

20

- Sauvegarder la dataframe

In [77]:
pdf_cleaned = df_cleaned.toPandas().to_csv("../data/processed/data-with-relevant-columns.csv")