Demand Forecasting: Developing models to predict future demand for various products accurately. This involves analyzing historical sales data, market trends, seasonality, and external factors like economic indicators or events that could influence demand. Accurate forecasts are crucial for planning production, inventory levels, and avoiding stockouts or excess inventory.

Supply Chain Optimization: Identifying the most efficient and cost-effective strategies for sourcing materials, manufacturing, and distributing products to retailers or directly to consumers. This might involve route optimization, supplier selection, and evaluating make-or-buy decisions.

Inventory Management: Developing algorithms to optimize inventory levels across different warehouses and retail outlets to ensure that products are available where and when they're needed, minimizing holding costs and reducing the risk of stockouts or overstock situations.

Product Lifecycle Management: Analyzing sales and customer feedback data to determine the lifecycle stage of each product, helping the company decide when to introduce new models, discontinue products, or run promotions to clear out inventory.

Cost Reduction and Efficiency Improvement: Identifying opportunities to reduce costs and improve operational efficiency across the supply chain. This could involve automating manual processes, improving the accuracy of demand planning to reduce the need for expedited shipping, and optimizing the product mix to maximize profitability.

Risk Management: Assessing and mitigating risks related to supply chain disruption, such as supplier reliability, geopolitical factors, natural disasters, and pandemics. Developing contingency plans and strategies to ensure supply chain resilience.

Sustainability Analysis: Evaluating the environmental impact of supply chain operations and identifying opportunities to reduce carbon footprint, waste, and improve sustainability practices in production and distribution.

Customer Experience Enhancement: Analyzing customer feedback and return data to identify issues with product quality or features that could be improved. Working with product development and quality assurance teams to address these issues can enhance customer satisfaction and loyalty.

Market Trend Analysis: Keeping abreast of market trends and technological advancements in the home appliance sector to forecast shifts in consumer preferences and emerging opportunities or threats.

Supplier Performance Management: Developing metrics and monitoring systems to assess supplier performance in terms of quality, delivery, and cost. This information can be used to negotiate better terms, identify areas for improvement, or make decisions about changing suppliers.

 1- Demand Forecasting: Developing models to predict future demand for various products accurately. This involves analyzing historical sales data, market trends, seasonality, and external factors like economic indicators or events that could influence demand. Accurate forecasts are crucial for planning production, inventory levels, and avoiding stockouts or excess inventory.

In [122]:
#!pip install Faker

In [123]:
import pandas as pd
import numpy as np
from faker import Faker
import json
import random

fake = Faker()

# Function to generate random JSON for complex data columns
def generate_random_json():
    return json.dumps({'key': fake.word(), 'value': fake.random_number()})

# Sample data generation
data = {
    'ProductID': [fake.bothify(text='???-###') for _ in range(100)],
    'ProductName': [fake.word().capitalize() for _ in range(100)],
    'Category': [random.choice(['Refrigerator', 'Microwave', 'Washer', 'Dryer', 'Oven']) for _ in range(100)],
    'QuantitySold': np.random.randint(1, 50, size=100),
    'SalesDate': [fake.date_between(start_date='-2y', end_date='today') for _ in range(100)],
    'UnitPrice': np.random.uniform(100, 2000, size=100).round(2),
    'Revenue': np.random.uniform(200, 10000, size=100).round(2),
    'Channel': [random.choice(['Online', 'In-store', 'Distributor']) for _ in range(100)],
    'Weekday': [fake.day_of_week() for _ in range(100)],
    'Month': np.random.randint(1, 13, size=100),
    'Quarter': np.random.randint(1, 5, size=100),
    'Year': np.random.randint(2019, 2023, size=100),
    'Holiday': np.random.choice([True, False], size=100),
    'EconomicIndicators': [generate_random_json() for _ in range(100)],
    'MarketTrends': [generate_random_json() for _ in range(100)],
    'CompetitorPricing': np.random.uniform(100, 2000, size=100).round(2),
    'Promotions': np.random.choice([True, False], size=100),
    'WeatherConditions': [random.choice(['Sunny', 'Rainy', 'Snowy', 'Cloudy']) for _ in range(100)],
    'PoliticalEvents': np.random.choice([True, False], size=100),
    'StockLevels': np.random.randint(0, 100, size=100),
    'LeadTime': np.random.randint(1, 30, size=100),
    'SupplierPerformance': np.random.uniform(0.5, 1.0, size=100).round(2),
    'CustomerSegment': [random.choice(['Youth', 'Adult', 'Senior']) for _ in range(100)],
    'PurchaseHistory': [generate_random_json() for _ in range(100)],
    'ProductRatingsReviews': np.random.uniform(1, 5, size=100).round(2),
    'Features': [generate_random_json() for _ in range(100)],
    'LaunchDate': [fake.date_between(start_date='-5y', end_date='today') for _ in range(100)],
    'LifeCycleStage': [random.choice(['Introduction', 'Growth', 'Maturity', 'Decline']) for _ in range(100)],
}

# Creating DataFrame
df = pd.DataFrame(data)
df.head()

Unnamed: 0,ProductID,ProductName,Category,QuantitySold,SalesDate,UnitPrice,Revenue,Channel,Weekday,Month,...,PoliticalEvents,StockLevels,LeadTime,SupplierPerformance,CustomerSegment,PurchaseHistory,ProductRatingsReviews,Features,LaunchDate,LifeCycleStage
0,AXX-284,Population,Dryer,2,2023-04-08,1767.49,8297.32,Distributor,Tuesday,5,...,True,13,28,0.76,Senior,"{""key"": ""cause"", ""value"": 23638724}",2.27,"{""key"": ""my"", ""value"": 567049}",2023-08-19,Decline
1,FHb-183,Care,Washer,6,2022-08-23,695.98,4474.01,Distributor,Monday,3,...,True,19,29,0.6,Youth,"{""key"": ""weight"", ""value"": 7}",4.67,"{""key"": ""see"", ""value"": 88703295}",2021-12-06,Introduction
2,QBK-760,Job,Microwave,10,2022-12-17,1200.98,2609.65,In-store,Saturday,9,...,False,75,16,0.87,Senior,"{""key"": ""between"", ""value"": 35}",3.7,"{""key"": ""sure"", ""value"": 738250840}",2023-11-09,Introduction
3,Aqe-765,Back,Refrigerator,29,2023-04-15,1893.33,9228.89,In-store,Tuesday,7,...,False,92,26,0.72,Adult,"{""key"": ""team"", ""value"": 8740}",1.17,"{""key"": ""herself"", ""value"": 88470213}",2021-09-27,Growth
4,NUN-031,Natural,Washer,41,2023-06-05,1434.96,6755.29,Online,Wednesday,5,...,False,73,21,0.96,Senior,"{""key"": ""ever"", ""value"": 72771}",3.41,"{""key"": ""turn"", ""value"": 89}",2019-11-18,Maturity


In [124]:
# Function to randomly insert NaN values into each column
def insert_random_nans(df, fraction=0.1):
    assert 0 < fraction < 1, "Fraction must be between 0 and 1"

    nrows, ncols = df.shape
    for col in df.columns:
        # Number of NaNs to insert
        n_nans = int(np.floor(nrows * fraction))
        # Randomly choose indices to replace with NaNs
        nan_indices = np.random.choice(nrows, n_nans, replace=False)
        df.iloc[nan_indices, df.columns.get_loc(col)] = np.nan

# Apply the function to your DataFrame
insert_random_nans(df, fraction=0.1)  # This will replace ~10% of values in each column with NaNs
df.head()

  df.iloc[nan_indices, df.columns.get_loc(col)] = np.nan


Unnamed: 0,ProductID,ProductName,Category,QuantitySold,SalesDate,UnitPrice,Revenue,Channel,Weekday,Month,...,PoliticalEvents,StockLevels,LeadTime,SupplierPerformance,CustomerSegment,PurchaseHistory,ProductRatingsReviews,Features,LaunchDate,LifeCycleStage
0,,Population,Dryer,2.0,,1767.49,8297.32,Distributor,Tuesday,5.0,...,,13.0,28.0,0.76,Senior,"{""key"": ""cause"", ""value"": 23638724}",2.27,"{""key"": ""my"", ""value"": 567049}",2023-08-19,Decline
1,FHb-183,Care,Washer,6.0,2022-08-23,695.98,4474.01,Distributor,Monday,3.0,...,,19.0,29.0,,Youth,"{""key"": ""weight"", ""value"": 7}",4.67,,2021-12-06,Introduction
2,,Job,Microwave,10.0,2022-12-17,1200.98,2609.65,In-store,Saturday,9.0,...,False,75.0,16.0,0.87,,"{""key"": ""between"", ""value"": 35}",3.7,"{""key"": ""sure"", ""value"": 738250840}",2023-11-09,
3,Aqe-765,Back,Refrigerator,29.0,2023-04-15,1893.33,9228.89,In-store,Tuesday,7.0,...,False,92.0,26.0,0.72,Adult,"{""key"": ""team"", ""value"": 8740}",1.17,,2021-09-27,Growth
4,NUN-031,Natural,Washer,41.0,2023-06-05,1434.96,6755.29,Online,,5.0,...,,73.0,21.0,0.96,Senior,"{""key"": ""ever"", ""value"": 72771}",3.41,"{""key"": ""turn"", ""value"": 89}",2019-11-18,Maturity


inspect dataset df

extract json values into new columns

In [127]:
# Function to check if a cell contains a valid JSON object with 'key' and 'value'
def is_valid_json(cell):
    try:
        json_obj = json.loads(cell)
        return 'key' in json_obj and 'value' in json_obj
    except:
        return False

# Loop through each column in the DataFrame
for col in df.columns:
    # Check the first row to see if it contains the JSON pattern we're interested in
    # Assuming if one row is JSON, the rest are too. Adjust logic as needed for your use case.
    if is_valid_json(df[col].iloc[0]):
        # Parse the column
        df[f'{col}_Key'] = df[col].apply(lambda x: json.loads(x)['key'] if is_valid_json(x) else None)
        df[f'{col}_Value'] = df[col].apply(lambda x: json.loads(x)['value'] if is_valid_json(x) else None)
        df = df.drop(columns=col)

df.head()

Unnamed: 0,ProductID,ProductName,Category,QuantitySold,SalesDate,UnitPrice,Revenue,Channel,Weekday,Month,...,LaunchDate,LifeCycleStage,EconomicIndicators_Key,EconomicIndicators_Value,MarketTrends_Key,MarketTrends_Value,PurchaseHistory_Key,PurchaseHistory_Value,Features_Key,Features_Value
0,,Population,Dryer,2.0,,1767.49,8297.32,Distributor,Tuesday,5.0,...,2023-08-19,Decline,election,4404.0,toward,431973732.0,cause,23638724.0,my,567049.0
1,FHb-183,Care,Washer,6.0,2022-08-23,695.98,4474.01,Distributor,Monday,3.0,...,2021-12-06,Introduction,where,9.0,,,weight,7.0,,
2,,Job,Microwave,10.0,2022-12-17,1200.98,2609.65,In-store,Saturday,9.0,...,2023-11-09,,lead,5077492.0,onto,95.0,between,35.0,sure,738250840.0
3,Aqe-765,Back,Refrigerator,29.0,2023-04-15,1893.33,9228.89,In-store,Tuesday,7.0,...,2021-09-27,Growth,start,7499883.0,nation,434.0,team,8740.0,,
4,NUN-031,Natural,Washer,41.0,2023-06-05,1434.96,6755.29,Online,,5.0,...,2019-11-18,Maturity,direction,36168.0,five,9845051.0,ever,72771.0,turn,89.0


In [125]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 28 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ProductID              90 non-null     object 
 1   ProductName            90 non-null     object 
 2   Category               90 non-null     object 
 3   QuantitySold           90 non-null     float64
 4   SalesDate              90 non-null     object 
 5   UnitPrice              90 non-null     float64
 6   Revenue                90 non-null     float64
 7   Channel                90 non-null     object 
 8   Weekday                90 non-null     object 
 9   Month                  90 non-null     float64
 10  Quarter                90 non-null     float64
 11  Year                   90 non-null     float64
 12  Holiday                90 non-null     object 
 13  EconomicIndicators     90 non-null     object 
 14  MarketTrends           90 non-null     object 
 15  Competi

In [128]:
df_numerical = df.select_dtypes(include=[np.number])
df_numerical.head()

Unnamed: 0,QuantitySold,UnitPrice,Revenue,Month,Quarter,Year,CompetitorPricing,StockLevels,LeadTime,SupplierPerformance,ProductRatingsReviews,EconomicIndicators_Value,MarketTrends_Value,PurchaseHistory_Value,Features_Value
0,2.0,1767.49,8297.32,5.0,3.0,2021.0,193.36,13.0,28.0,0.76,2.27,4404.0,431973732.0,23638724.0,567049.0
1,6.0,695.98,4474.01,3.0,1.0,2021.0,449.0,19.0,29.0,,4.67,9.0,,7.0,
2,10.0,1200.98,2609.65,9.0,4.0,2021.0,1135.67,75.0,16.0,0.87,3.7,5077492.0,95.0,35.0,738250840.0
3,29.0,1893.33,9228.89,7.0,2.0,2022.0,192.18,92.0,26.0,0.72,1.17,7499883.0,434.0,8740.0,
4,41.0,1434.96,6755.29,5.0,4.0,2022.0,587.92,73.0,21.0,0.96,3.41,36168.0,9845051.0,72771.0,89.0


fill nan values with deep learning prediction

In [129]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.impute import SimpleImputer

def fill_missing_with_dl(df, column):
    # Define features and target
    features = df.columns.difference([column]).tolist()
    X = df[features]
    y = df[column]
    
    # Handle missing values in features
    imputer = SimpleImputer(strategy='median')
    X_imputed = imputer.fit_transform(X)
    
    # Split the data into training and prediction sets
    X_train = X_imputed[~y.isna()]
    y_train = y[~y.isna()].values
    X_predict = X_imputed[y.isna()]

    # Normalize the input features
    X_mean = X_train.mean(axis=0)
    X_std = X_train.std(axis=0)
    X_train = (X_train - X_mean) / X_std
    X_predict = (X_predict - X_mean) / X_std

    # Check if there's anything to predict
    if not X_predict.size:
        return df
    
    # Define the deep learning model
    model = Sequential([
        Dense(128, activation='relu', input_dim=X_train.shape[1]),
        Dense(64, activation='relu'),
        Dense(32, activation='relu'),
        Dense(1)
    ])

    model.compile(optimizer='adam', loss='mean_squared_error')
    
    # Early stopping to prevent overfitting
    early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
    
    # Split training data for validation
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
    
    # Train the model
    model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=200, callbacks=[early_stopping], batch_size=32)
    
    # Predict the missing values
    predicted_values = model.predict(X_predict)
    
    # Fill in the missing values in the original DataFrame
    df.loc[df[column].isna(), column] = predicted_values.flatten()
    
    # Round each column to 1 decimal places after filling
    df = df.round(1)
    
    return df

for column in df_numerical.columns.to_list():
    df_numerical = fill_missing_with_dl(df_numerical, column)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoc

In [130]:
df_numerical.isna().sum()

QuantitySold                0
UnitPrice                   0
Revenue                     0
Month                       0
Quarter                     0
Year                        0
CompetitorPricing           0
StockLevels                 0
LeadTime                    0
SupplierPerformance         0
ProductRatingsReviews       0
EconomicIndicators_Value    0
MarketTrends_Value          0
PurchaseHistory_Value       0
Features_Value              0
dtype: int64

apply numeric values standard scaling

In [131]:
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler object
scaler = StandardScaler()

# Apply standardization to the numerical columns
df_numerical[df_numerical.columns[1:]] = scaler.fit_transform(df_numerical[df_numerical.columns[1:]])
df_numerical.head()

Unnamed: 0,QuantitySold,UnitPrice,Revenue,Month,Quarter,Year,CompetitorPricing,StockLevels,LeadTime,SupplierPerformance,ProductRatingsReviews,EconomicIndicators_Value,MarketTrends_Value,PurchaseHistory_Value,Features_Value
0,2.0,1.356348,1.156912,-0.38794,0.642066,0.253843,-1.364285,-1.342198,1.5008,0.116193,-0.681875,-0.400524,1.771885,-0.12384,-0.356334
1,6.0,-0.839786,-0.197489,-0.989865,-1.268845,0.253843,-0.889651,-1.120341,1.623585,1.33928,1.437943,-0.400545,-0.372641,-0.300062,-0.359465
2,10.0,0.195256,-0.857951,0.815909,1.597522,0.253843,0.38551,0.950325,0.027381,0.727737,0.554685,-0.375744,-0.372641,-0.300062,3.716853
3,29.0,1.614186,1.486931,0.213984,-0.313389,0.263317,-1.366513,1.57892,1.25523,-0.49535,-1.653457,-0.363912,-0.372639,-0.299997,-0.359465
4,41.0,0.67486,0.61066,-0.38794,1.597522,0.263317,-0.631722,0.876372,0.641306,1.33928,0.289708,-0.400368,-0.323765,-0.299519,-0.359464


In [132]:
df_categorical = df.select_dtypes(exclude=[np.number])
df_categorical.head()

Unnamed: 0,ProductID,ProductName,Category,SalesDate,Channel,Weekday,Holiday,Promotions,WeatherConditions,PoliticalEvents,CustomerSegment,LaunchDate,LifeCycleStage,EconomicIndicators_Key,MarketTrends_Key,PurchaseHistory_Key,Features_Key
0,,Population,Dryer,,Distributor,Tuesday,True,True,Rainy,,Senior,2023-08-19,Decline,election,toward,cause,my
1,FHb-183,Care,Washer,2022-08-23,Distributor,Monday,False,True,Cloudy,,Youth,2021-12-06,Introduction,where,,weight,
2,,Job,Microwave,2022-12-17,In-store,Saturday,False,True,Cloudy,False,,2023-11-09,,lead,onto,between,sure
3,Aqe-765,Back,Refrigerator,2023-04-15,In-store,Tuesday,False,,Cloudy,False,Adult,2021-09-27,Growth,start,nation,team,
4,NUN-031,Natural,Washer,2023-06-05,Online,,True,False,Rainy,,Senior,2019-11-18,Maturity,direction,five,ever,turn


In [133]:
# Loop through the DataFrame columns to find those containing 'date'
date_columns = [col for col in df_categorical.columns if 'date' in col.lower()]

# Convert identified 'date' columns to datetime format
for col in date_columns:
    df_categorical[col] = pd.to_datetime(df_categorical[col])

df_categorical = df_categorical.drop(columns=date_columns)

In [134]:
df_categorical.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   ProductID               90 non-null     object
 1   ProductName             90 non-null     object
 2   Category                90 non-null     object
 3   Channel                 90 non-null     object
 4   Weekday                 90 non-null     object
 5   Holiday                 90 non-null     object
 6   Promotions              90 non-null     object
 7   WeatherConditions       90 non-null     object
 8   PoliticalEvents         90 non-null     object
 9   CustomerSegment         90 non-null     object
 10  LifeCycleStage          90 non-null     object
 11  EconomicIndicators_Key  90 non-null     object
 12  MarketTrends_Key        90 non-null     object
 13  PurchaseHistory_Key     90 non-null     object
 14  Features_Key            90 non-null     object
dtypes: obje

fill nan values with deep learning prediction

In [135]:
df_categorical.isna().sum()

ProductID                 10
ProductName               10
Category                  10
Channel                   10
Weekday                   10
Holiday                   10
Promotions                10
WeatherConditions         10
PoliticalEvents           10
CustomerSegment           10
LifeCycleStage            10
EconomicIndicators_Key    10
MarketTrends_Key          10
PurchaseHistory_Key       10
Features_Key              10
dtype: int64

In [136]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping

def fill_missing_with_dl_categorical(df, column):
    # Copy the DataFrame to avoid changes to the original
    df_copy = df.copy()

    # Initialize LabelEncoder for all columns
    le_dict = {}
    for col in df_copy.columns:
        le = LabelEncoder()
        # Convert NaN to a placeholder string and then encode
        df_copy[col] = le.fit_transform(df_copy[col].fillna('Missing').astype(str))
        le_dict[col] = le
    
    # Define features and target
    features = [col for col in df_copy.columns if col != column]
    X = df_copy[features]
    y = df_copy[column]
    
    # Handle missing values in features
    imputer = SimpleImputer(strategy='median')
    X_imputed = imputer.fit_transform(X)

    # Normalize the input features
    X_mean = np.mean(X_imputed, axis=0)
    X_std = np.std(X_imputed, axis=0)
    X_normalized = (X_imputed - X_mean) / (X_std + 1e-6)

    # Identify rows with missing target values after encoding ('Missing' placeholder)
    if 'Missing' in le_dict[column].classes_:
        missing_value_encoded = le_dict[column].transform(['Missing'])[0]
        missing_indices = (y == missing_value_encoded)
    else:
        return df

    # Split the data into training and prediction sets
    X_train = X_normalized[~missing_indices]
    y_train = y[~missing_indices]
    X_predict = X_normalized[missing_indices]

    # Check if there's anything to predict
    if not X_predict.size:
        return df
    
    # Define the deep learning model
    model = Sequential([
        Dense(128, activation='relu', input_dim=X_train.shape[1]),
        Dense(64, activation='relu'),
        Dense(32, activation='relu'),
        Dense(1)  # Output layer for regression-like approach
    ])

    model.compile(optimizer='adam', loss='mean_squared_error')
    early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
    
    # Train the model
    X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
    model.fit(X_train_split, y_train_split, validation_data=(X_val_split, y_val_split), epochs=200, callbacks=[early_stopping], batch_size=32)
    
    # Predict the missing values
    predicted_values = model.predict(X_predict).flatten()
    
    # Inverse transform the predicted values back to original categories, ignoring 'Missing'
    predicted_categories = le_dict[column].inverse_transform(np.round(predicted_values).astype(int))
    df.loc[df[column].isna(), column] = predicted_categories

    return df

for column in df_categorical.columns.to_list():
    df_categorical = fill_missing_with_dl_categorical(df_categorical, column)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoc

In [137]:
df_categorical.isna().sum()

ProductID                 0
ProductName               0
Category                  0
Channel                   0
Weekday                   0
Holiday                   0
Promotions                0
WeatherConditions         0
PoliticalEvents           0
CustomerSegment           0
LifeCycleStage            0
EconomicIndicators_Key    0
MarketTrends_Key          0
PurchaseHistory_Key       0
Features_Key              0
dtype: int64

In [138]:
df_categorical.head()

Unnamed: 0,ProductID,ProductName,Category,Channel,Weekday,Holiday,Promotions,WeatherConditions,PoliticalEvents,CustomerSegment,LifeCycleStage,EconomicIndicators_Key,MarketTrends_Key,PurchaseHistory_Key,Features_Key
0,WVP-224,Population,Dryer,Distributor,Tuesday,True,True,Rainy,False,Senior,Decline,election,toward,cause,my
1,FHb-183,Care,Washer,Distributor,Monday,False,True,Cloudy,True,Youth,Introduction,where,of,weight,medical
2,frB-431,Job,Microwave,In-store,Saturday,False,True,Cloudy,False,Missing,Growth,lead,onto,between,sure
3,Aqe-765,Back,Refrigerator,In-store,Tuesday,False,Missing,Cloudy,False,Adult,Growth,start,nation,team,once
4,NUN-031,Natural,Washer,Online,Saturday,True,False,Rainy,Missing,Senior,Maturity,direction,five,ever,turn


apply unique values as 'other' label

In [139]:
for column in df_categorical.columns:
    value_counts = df_categorical[column].value_counts()
    unique_values = value_counts[value_counts == 1].index.tolist()
    df_categorical[column] = df_categorical[column].apply(lambda x: 'other' if x in unique_values else x)

In [140]:
for column in df_categorical.columns:
    print(df_categorical[column].value_counts())

ProductID
other      80
WVP-224     2
frB-431     2
XWL-477     2
eXu-255     2
WYQ-059     2
XfO-687     2
bmP-687     2
fHD-956     2
axh-757     2
dFy-336     2
Name: count, dtype: int64
ProductName
other       75
Indicate     3
Job          2
Our          2
General      2
Impact       2
Instead      2
What         2
None         2
Keep         2
Know         2
Game         2
Window       2
Name: count, dtype: int64
Category
Washer          24
Oven            23
Dryer           22
Refrigerator    17
Microwave       11
Missing          3
Name: count, dtype: int64
Channel
Distributor    40
In-store       40
Online         19
other           1
Name: count, dtype: int64
Weekday
Sunday       20
Saturday     16
Thursday     16
Tuesday      14
Wednesday    12
Monday       11
Friday       11
Name: count, dtype: int64
Holiday
True       47
False      43
Missing     9
other       1
Name: count, dtype: int64
Promotions
True       47
False      43
Missing    10
Name: count, dtype: int64
Weather

apply one hot encoding for categorical

In [141]:
df_categorical_encoded = pd.get_dummies(df_categorical.iloc[:, 2:-4], drop_first=True).astype('int')
del df_categorical
df_categorical_encoded.head()

Unnamed: 0,Category_Microwave,Category_Missing,Category_Oven,Category_Refrigerator,Category_Washer,Channel_In-store,Channel_Online,Channel_other,Weekday_Monday,Weekday_Saturday,...,PoliticalEvents_True,PoliticalEvents_False,PoliticalEvents_Missing,PoliticalEvents_other,CustomerSegment_Missing,CustomerSegment_Senior,CustomerSegment_Youth,LifeCycleStage_Growth,LifeCycleStage_Introduction,LifeCycleStage_Maturity
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
1,0,0,0,0,1,0,0,0,1,0,...,0,0,0,1,0,0,1,0,1,0
2,1,0,0,0,0,1,0,0,0,1,...,0,0,0,0,1,0,0,1,0,0
3,0,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,1,0,1,0,0,1,...,0,0,1,0,0,1,0,0,0,1


concat numerical and categorical columns

In [142]:
# Assuming you have two datasets named df1 and df2
concatenated_df = pd.concat([df_numerical, df_categorical_encoded], axis=1)
del df_numerical, df_categorical_encoded
concatenated_df.head()

Unnamed: 0,QuantitySold,UnitPrice,Revenue,Month,Quarter,Year,CompetitorPricing,StockLevels,LeadTime,SupplierPerformance,...,PoliticalEvents_True,PoliticalEvents_False,PoliticalEvents_Missing,PoliticalEvents_other,CustomerSegment_Missing,CustomerSegment_Senior,CustomerSegment_Youth,LifeCycleStage_Growth,LifeCycleStage_Introduction,LifeCycleStage_Maturity
0,2.0,1.356348,1.156912,-0.38794,0.642066,0.253843,-1.364285,-1.342198,1.5008,0.116193,...,0,1,0,0,0,1,0,0,0,0
1,6.0,-0.839786,-0.197489,-0.989865,-1.268845,0.253843,-0.889651,-1.120341,1.623585,1.33928,...,0,0,0,1,0,0,1,0,1,0
2,10.0,0.195256,-0.857951,0.815909,1.597522,0.253843,0.38551,0.950325,0.027381,0.727737,...,0,0,0,0,1,0,0,1,0,0
3,29.0,1.614186,1.486931,0.213984,-0.313389,0.263317,-1.366513,1.57892,1.25523,-0.49535,...,0,0,0,0,0,0,0,1,0,0
4,41.0,0.67486,0.61066,-0.38794,1.597522,0.263317,-0.631722,0.876372,0.641306,1.33928,...,0,0,1,0,0,1,0,0,0,1


ML