# Supply chain data analysis.
- I will go through the standard data analysis procedures to see what I can find in this dataset. Everything will be documented and at the end I will have a notebook, a report as well as a presentation.

1. Importing libraries and loading data

In [12]:
import pandas as pd

# load the data set
df = pd.read_csv('data/supply_chain_data.csv')
df.head()

Unnamed: 0,Product type,SKU,Price,Availability,Number of products sold,Revenue generated,Customer demographics,Stock levels,Lead times,Order quantities,...,Location,Lead time,Production volumes,Manufacturing lead time,Manufacturing costs,Inspection results,Defect rates,Transportation modes,Routes,Costs
0,haircare,SKU0,69.808006,55,802,8661.996792,Non-binary,58,7,96,...,Mumbai,29,215,29,46.279879,Pending,0.22641,Road,Route B,187.752075
1,skincare,SKU1,14.843523,95,736,7460.900065,Female,53,30,37,...,Mumbai,23,517,30,33.616769,Pending,4.854068,Road,Route B,503.065579
2,haircare,SKU2,11.319683,34,8,9577.749626,Unknown,1,10,88,...,Mumbai,12,971,27,30.688019,Pending,4.580593,Air,Route C,141.920282
3,skincare,SKU3,61.163343,68,83,7766.836426,Non-binary,23,13,59,...,Kolkata,24,937,18,35.624741,Fail,4.746649,Rail,Route A,254.776159
4,skincare,SKU4,4.805496,26,871,2686.505152,Non-binary,5,3,56,...,Delhi,5,414,3,92.065161,Fail,3.14558,Air,Route A,923.440632


This shows the first 5 entries for our data sets.

2. Data Cleaning and Preprocessing

In [13]:
# Checking for duplicate data.
if df.duplicated().any():
    print(f"There are {df.duplicated().sum()} duplicates in our dataset.")
else:
    print("There are no duplicate values in the dataset.")

There are no duplicate values in the dataset.


3. Data Exploration

In [14]:
df.shape

(100, 24)

In [15]:
df.columns

Index(['Product type', 'SKU', 'Price', 'Availability',
       'Number of products sold', 'Revenue generated', 'Customer demographics',
       'Stock levels', 'Lead times', 'Order quantities', 'Shipping times',
       'Shipping carriers', 'Shipping costs', 'Supplier name', 'Location',
       'Lead time', 'Production volumes', 'Manufacturing lead time',
       'Manufacturing costs', 'Inspection results', 'Defect rates',
       'Transportation modes', 'Routes', 'Costs'],
      dtype='object')

Columns:

1. Product Category Columns.
- Product Type: Specific type of product.
- SKU: Unique Identifier.

2. Key Metrics Columns.
- Price: Price of the Product.
- Availability: Information about product availability.
- Number of Products sold: The number of products sold in a particular time period.
- Revenue Generated: Total Revenue generated by a product in a specific time period.
- Customer Demographics: Customer Information such as age, gender etc.

3. Supply Chain Details.
- Stock Levels: No. of products available at any given time.
- Lead Times: Time required to order and receive products form the suppliers.
- Order Quantities: The number of products ordered in one order of shipment.
- Shipping time: Time required to ship the product from the warehouse to the customer.
- Shipping carriers: Company used to ship the products.
- Shipping Costs: Costs associated with shiping products.
- Supplier Name: Name of the supplier who provides products or materials to the company.
- Location: The location associated with the data in the supply chain.
- Lead Time: The time required to obtain products or materials from a supplier.
- Production volumes: The number of products produced in a certain time period.
- Manufacturing Lead time: The time required to produce a product, from scratch.
- Manufacturing costs: Costs related to the production process.

4. Quality Metrics.
- Inspection results: Results of the material quality inspection.
- Defect Rates: The level of defects in the products produced.

5. Transportation Details.
- Transportation modes: Mode of the transport for the product.
- Routes: Routes or paths used to send the products from one point to other in the supply chain.
- Costs: Costs related to various aspects of the supply chain, including transportation costs, production costs and other costs.


We will now look at the summary statistics of our code.

In [18]:
df.describe()

Unnamed: 0,Price,Availability,Number of products sold,Revenue generated,Stock levels,Lead times,Order quantities,Shipping times,Shipping costs,Lead time,Production volumes,Manufacturing lead time,Manufacturing costs,Defect rates,Costs
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,49.462461,48.4,460.99,5776.048187,47.77,15.96,49.22,5.75,5.548149,17.08,567.84,14.77,47.266693,2.277158,529.245782
std,31.168193,30.743317,303.780074,2732.841744,31.369372,8.785801,26.784429,2.724283,2.651376,8.846251,263.046861,8.91243,28.982841,1.461366,258.301696
min,1.699976,1.0,8.0,1061.618523,0.0,1.0,1.0,1.0,1.013487,1.0,104.0,1.0,1.085069,0.018608,103.916248
25%,19.597823,22.75,184.25,2812.847151,16.75,8.0,26.0,3.75,3.540248,10.0,352.0,7.0,22.983299,1.00965,318.778455
50%,51.239831,43.5,392.5,6006.352023,47.5,17.0,52.0,6.0,5.320534,18.0,568.5,14.0,45.905622,2.141863,520.430444
75%,77.198228,75.0,704.25,8253.976921,73.0,24.0,71.25,8.0,7.601695,25.0,797.0,23.0,68.621026,3.563995,763.078231
max,99.171329,100.0,996.0,9866.465458,100.0,30.0,96.0,10.0,9.929816,30.0,985.0,30.0,99.466109,4.939255,997.41345


After generating summary statistics again confirm there are no missing values in our dataset. We also see some additional information about our other columns some containing extreme values that may later need to be scaled for our modelling.