# Initial Data Exploration 


In [1]:
import pandas as pd

## receivals.csv 

Exploration of the receivals.csv dataset to understand its structure and content.
The goal is to see how the different columns relate to each other and to understand what each row represents. We want to check the uniqueness of each row, look at how purchases are linked to suppliers, materials, products, and items, and see how these elements connect.

In [11]:
receivals = pd.read_csv("../../data/kernel/receivals.csv")
#receivals.head()

### Number of unique orders, materials and products  

In [None]:

unique_orders_count = receivals["purchase_order_id"].nunique()
unique_products_count = receivals["product_id"].nunique()
unique_materials_count = receivals["rm_id"].nunique()
print(f"Number of unique purchase orders: {unique_orders_count}")
print(f"Number of unique products: {unique_products_count}")
print(f"Number of unique raw materials: {unique_materials_count}")

Number of unique purchase orders: 7173
Number of unique products: 54
Number of unique raw materials: 203


### Number of material per order 

In [30]:
rm_per_order = (
    receivals.groupby("purchase_order_id")["rm_id"]
    .nunique()
    .reset_index(name="unique_rm_count")
)

#print(rm_per_order.head())
distribution = rm_per_order["unique_rm_count"].value_counts().sort_index()
print("\nDistribution of number of unique materials per order:")
print(distribution)


Distribution of number of unique materials per order:
unique_rm_count
1     5026
2     1082
3      305
4      231
5      161
6      106
7       73
8       62
9       45
10      38
11      22
12      10
13       3
14       4
15       1
16       1
22       1
24       2
Name: count, dtype: int64


### Number of products per order

In [25]:
products_per_order = (
    receivals.groupby("purchase_order_id")["product_id"]
    .nunique()
    .reset_index(name="unique_product_count")
)

distribution = products_per_order["unique_product_count"].value_counts().sort_index()
print("\nDistribution of number of unique products per order:")
print(distribution)


Distribution of number of unique products per order:
unique_product_count
1    5727
2     696
3     313
4     196
5     175
6      60
7       5
8       1
Name: count, dtype: int64


### Number of material per product 

In [29]:
rm_per_product = (
    receivals.groupby("product_id")["rm_id"]
    .nunique()
    .reset_index(name="unique_rm_count")
)

multi_rm_products = rm_per_product[rm_per_product["unique_rm_count"] > 1]

print(f"\nNumber of products with multiple raw materials: {len(multi_rm_products)}")
#print(multi_rm_products.head()) 

distribution = rm_per_product["unique_rm_count"].value_counts().sort_index()
print("\nDistribution of number of materials per product:")
print(distribution)


Number of products with multiple raw materials: 37

Distribution of number of materials per product:
unique_rm_count
1     17
2     15
3      9
4      1
5      1
7      2
8      2
9      2
11     2
15     1
16     1
20     1
Name: count, dtype: int64


### Number of products per material 

In [31]:
products_per_material = (
    receivals.groupby("rm_id")["product_id"]
    .nunique()
    .reset_index(name="unique_product_count")
)

multi_product_materials = products_per_material[products_per_material["unique_product_count"] > 1]

print(f"Number of materials used in multiple products: {len(multi_product_materials)}")
print(multi_product_materials.head())  

distribution = products_per_material["unique_product_count"].value_counts().sort_index()
print("\nDistribution of products per material:")
print(distribution)

Number of materials used in multiple products: 1
      rm_id  unique_product_count
161  3362.0                     2

Distribution of products per material:
unique_product_count
1    202
2      1
Name: count, dtype: int64
