# Assignments

## Assignment 1

### DataFrame Formatting

In [1]:
# Set up shell for libraries, packages, and data

import pandas as pd
import numpy as np

metadata = pd.read_csv('../data/meta_data.csv')
df = pd.read_csv('../data/transaction_data.csv')

In [None]:
metadata


In [None]:
df

In [4]:
# Dataframe Formatting ~ Changing column names to follow the same naming convention

# Renaming columns for "transaction" dataframe

df.rename(columns={'sku_numb3r': 'sku_number','leadTime': 'lead_time','Division Code':'division_code','Order-Quantity':'order_quantity'}, inplace=True)

# view df column names as list:

print(df.columns.tolist())

['sku_number', 'inventory_type', 'stocking_type', 'lead_time', 'unit_price', 'manufacturing_site', 'division_code', 'transaction_date', 'order_quantity']


### Data Inspection

In [None]:
# inspect the numeric features of the transaction dataframe
df.describe()

 **Data Quality Issues (Oddities & Anomolies):**

* Missing Values- `manufacturing_site`, `division_code`, and `stocking_type` contain several missing values (NaN & NA).

* Outliers in `unit_price` - column contains at least one observation with a negative value; most values in the data set average around ~$800, there is at least one obervation with unit price of -$994.

* Outliers in `order_quantity` - most quantites are in double-digits, at least one observation has a low order number (3).


**Changes and Recommendations:**

* Remove NaN values and "NA" strings to preserve data integrity.
* Apply filter to `unit_price` by removing negative values and filter `order_quantity` to only reflect observations with a value of 10 or higher to remove outliers and avoid skewing. 



**Summary of Data Inspection:**

The summary statistics reveal a few important patterns and some clear anomalies. For example, `lead_time` appears highly consistent, with the 25th, 50th, and 75th percentiles all at 28 days which suggests a standardized process. Furhtermore, `unit_price` is tightly clustered between approximately $981 and $1004, which indicates price stability across most of the records. However, the minimum value of –$994 is possibly invalid and may indicate a misrecorded transaction or a potential return/refund. Additionnally, `order_quantity` shows a reasonable distribution centered around 80 units, however, the presence of a –119 minimum suggests a possible data issue, possibly related to returns or data entry errors. Lastly, the data needs to be inspected to ensure there are no duplicate entries. Overall, while the data appears generally well-structured but the extreme outliers, missing, and negative values should be flagged and addressed prior to proceeding with deeper analysis.


### Handling `NaN` Values

In [None]:
# drop na's for sku_number

df = df.dropna(subset=['sku_number'])

# Drop rows where sku_numner column contains the string "NA"
df = df[~df[['sku_number']].isin(['NA']).any(axis=1)]

# handling negative unit_price values by converting to absolute values & filtering out negative order_quantity values

df['unit_price'] = df['unit_price'].abs()
df = df[df['order_quantity'] >= 0]

df

`NaN` Handling Notes:

- Removed incomplete records by dropping rows with missing (NaN) values or placeholder "NA" strings in `sku_number` column to ensure data integrity, reliability, and accuracy for categorization and future analyses.

- Filtered outliers/ unusually high or low values in `unit_price` and `order_quantity` to avoid misleading results.



 ### Useful Information

In [17]:
# Determine unique sku_number count

sku_count = df['sku_number'].nunique()
print(f"Number of unique values in 'sku_number' column: {sku_count}")

Number of unique values in 'sku_number' column: 473


In [9]:
# Determine unique manufacturing_site count

manufacturing_count = df['manufacturing_site'].nunique()
print(f"Number of unique values in 'manufacturing_site' column: {manufacturing_count}")

Number of unique values in 'manufacturing_site' column: 15


In [10]:
# Determine division_code count

division_count = df['division_code'].nunique()
print(f"Number of unique values in 'division_code' column: {division_count}")

Number of unique values in 'division_code' column: 66


In [None]:
# Top 10 transactions by order_quantity

top_orders = df.nlargest(10, 'order_quantity')
top_orders

In [None]:
# Bottom 10 transactions by order_quantity

bottom_orders = df.nsmallest(10, 'order_quantity')
bottom_orders

In [None]:
# Top 10 transactions by total_sales_value

# create total_sales_value column and calculate transactions

df['total_sales_value'] = (df['unit_price'] * df['order_quantity'])

# Calculate and display top 10 sales transactions

top_sales = df.sort_values(by='total_sales_value', ascending=False).head(10)
top_sales.style.format({'total_sales_value': '${:,.2f}'})

In [None]:
# Bottom 10 transactions by total_sales_value

bottom_sales = df.sort_values(by='total_sales_value', ascending=True).head(10).round(2)
bottom_sales.style.format({'total_sales_value': '${:,.2f}'})

### References



1.	GeeksforGeeks – Working with Missing Data in Pandas
GeeksforGeeks. “Working with Missing Data in Pandas.” GeeksforGeeks, 28 July 2025, https://www.geeksforgeeks.org/data-analysis/working-with-missing-data-in-pandas/.

2.	GeeksforGeeks – Pandas: How to Use dropna() with Specific Columns
GeeksforGeeks. “Pandas DataFrame.dropna() Method.” GeeksforGeeks, 25 June 2025, https://www.geeksforgeeks.org/python/python-pandas-dataframe-dropna/.

3.	GeeksforGeeks – How to Find Duplicates in Pandas DataFrame (With Examples)
GeeksforGeeks. “Find Duplicate Rows in a Dataframe Based on All or Selected Columns.” GeeksforGeeks, 4 Dec. 2023, https://www.geeksforgeeks.org/python/find-duplicate-rows-in-a-dataframe-based-on-all-or-selected-columns/.

4.	GeeksforGeeks – Filter Pandas Dataframe with Multiple Conditions
GeeksforGeeks. “Filter Pandas Dataframe with Multiple Conditions.” GeeksforGeeks, 23 July 2025, https://www.geeksforgeeks.org/python/filter-pandas-dataframe-with-multiple-conditions/.

5.	GeeksforGeeks – Adding New Column to Existing DataFrame in Pandas
GeeksforGeeks. “Adding New Column to Existing DataFrame in Pandas.” GeeksforGeeks, 11 July 2025, https://www.geeksforgeeks.org/pandas/adding-new-column-to-existing-dataframe-in-pandas/.

6. GeeksforGeeks. Get n-Smallest Values from a Particular Column in Pandas DataFrame. GeeksforGeeks, 11 July 2025, https://www.geeksforgeeks.org/python/get-n-smallest-values-from-a-particular-column-in-pandas-dataframe/.

7.	Pandas Documentation – Working with Missing Data
Pandas Development Team. “Working with Missing Data.” Pandas 2.3.3 Documentation, pandas.pydata.org, https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html.

8.	Stackoverflow. (2012). "Renaming column names in Pandas". Retrieved October 1st, 2025. https://stackoverflow.com/questions/11346283/renaming-column-names-in-pandas

9.	Stackoverflow. (2012). "How to show all columns' names on a large pandas dataframe?Retrieved September 30th, 2025. https://stackoverflow.com/questions/11346283/renaming-column-names-in-pandas




## Assignment 2

In [18]:
# Set up shell for data and packages

import pandas as pd
from scipy.stats import norm
import numpy as np

# Load dataset
metadata = pd.read_csv('../data/meta_data.csv')
df = pd.read_csv('../data/transaction_data.csv')

In [None]:
# Dataframe Formatting 

# Renaming columns for "transaction" dataframe

df.rename(columns={'sku_numb3r': 'sku_number','leadTime': 'lead_time','Division Code':'division_code','Order-Quantity':'order_quantity'}, inplace=True)

# drop na's for sku_number

df = df.dropna(subset=['sku_number'])

# handling negative unit_price values by converting to absolute values & filtering out negative order_quantity values

df['unit_price'] = df['unit_price'].abs()
df = df[df['order_quantity'] >= 0]

df

### Required Transformation

In [None]:
# filter for appropriate SKUs

filtered_df = df[(df['inventory_type'] == 'FG') & (df['stocking_type'] == 'MTS')]
grouped_df = filtered_df.groupby('sku_number')
grouped_df.head()



In [None]:

# Aggregate order quantity stats per SKU group

aggregate_df = grouped_df.aggregate({'order_quantity': ['min', 'max', 'mean', 'median', 'var', 'std'],'lead_time': 'mean'})
aggregate_df.head()


### Safety Stock Calculation

In [None]:
# Calculate the safety stock for each SKU for the service level of 75%

# Compute the value at the 75th percentile
a75 = 0.75      
alpha_75 = norm.ppf(a75)

# calculate safety stock 
aggregate_df['safety_stock_75'] = (
    alpha_75 * aggregate_df[('order_quantity', 'std')] * np.sqrt(aggregate_df[('lead_time', 'mean')])
)


# Compute the value at the 90th percentile
a90 = 0.90
alpha_90 = norm.ppf(a90)

# calculate safety stock 

aggregate_df['safety_stock_90'] = (
    alpha_90 * aggregate_df[('order_quantity', 'std')] * np.sqrt((aggregate_df[('lead_time', 'mean')])))



# Compute the value at the 95th percentile
a95 = 0.95
alpha_95 = norm.ppf(a95)

# calculate safety stock 

aggregate_df['safety_stock_95'] = (
    alpha_95 * aggregate_df[('order_quantity', 'std')] * np.sqrt((aggregate_df[('lead_time', 'mean')])))

rounded_aggregate_df = aggregate_df.apply(np.ceil).astype(int)
rounded_aggregate_df



NOTES: added the`.apply(np.ceil).astype(int)` function to round safety stock calculation to the nearest whole number because you cant have partial units of inventory, however this doesn't provide the most accurate measure of safety stock because some are rounded up and some are rounded down (depending on the decimal point)

### Safety Stock Distribution

In [None]:
# Determine which SKU has the largest safety stock at 95% service level
max_safety_stock_95 = aggregate_df['safety_stock_95'].idxmax()
print("SKU with the largest safety stock at the 95th percentile:", max_safety_stock_95)


# Determine which SKU has the smallest safety stock at 95% service level
min_safety_stock_95 = aggregate_df['safety_stock_95'].idxmin()
print("SKU with the smallest safety stock at the 95th percentile:", min_safety_stock_95)

# Calculate average safety stock across all SKUs at 95% service level
average_safety_stock_95 = aggregate_df['safety_stock_95'].mean().round().astype(int)
print("Average safety stock at the 95th percentile:", average_safety_stock_95)


### References

* Bobbitt, Zach. “Pandas: How to Use dropna() with Specific Columns.” Statology, 13 Feb. 2023, https://www.statology.org/pandas-dropna-specific-column/.


* Kumar, Bijay. “Scipy Stats Zscore: Calculate and Use Z-Score.” *Python Guides*, 20 June 2025, https://pythonguides.com/scipy-stats-zscore/. Accessed 21 Oct. 2025.

* NumPy Developers. “numpy.round — NumPy v2.3 Manual.” *NumPy*, https://numpy.org/doc/stable/reference/generated/numpy.round.html. Accessed 21 Oct. 2025.

* pandas.DataFrame.aggregate — pandas 2.3.3 Documentation.” *pandas*, PyData, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.aggregate.html.





