## Data Preprocessing 
- is a crucial step in preparing datasets for machine learning. The goal is to ensure that data is clean, transformed into the appropriate format, and ready to be fed into a machine learning model for training and predictions. This step often involves handling missing data, encoding categorical variables, creating new features, and selecting the most relevant features to improve the

Key Steps in Preprocessing
 - Load Both Train and Test Data:

    - Read the datasets into memory from the specified file paths for both training and testing.

- Clean the Data:

    - Handle Missing Values:

        - Drop columns with more than 50% missing data.

        - Impute missing values in numerical columns using the median, and in categorical columns using the most frequent value.

- Encode Categorical Features:

    - Convert categorical variables into numeric format for compatibility with machine learning algorithms.

        - Label Encoding for ordinal categories.

        - One-Hot Encoding for nominal categories.

- Create New Features (Feature Engineering):

    - Extract relevant features from datetime columns like DayOfWeek, IsWeekend, MonthPosition, Quarter, etc.

        - Create new features such as:

        - Days to the next holiday.

        - Days after the last holiday.

        - Time since the last promotion.

        - Competition duration (years since the nearest competitor opened).

- Scale the numeric values

    - standardize the features using standard scaler

- Identify the Most Relevant Features:

    - Select features that provide the most significant insights into the problem at hand.
    
    - Drop unnecessary features that do not contribute to the model or have high missing values.

- Save the Preprocessed Data:

    - Once the data is clean, scaled, and ready for training, save the processed datasets to new CSV files for future use in model training.

In [1]:
# Import necessary libraries
import pandas as pd
import logging
import os, sys
# Add the 'scripts' directory to the Python path for module imports
sys.path.append(os.path.abspath(os.path.join('..', 'scripts')))
# Import data preprocessor class
from data_preprocessing import DataPreprocessor

# Set max rows and columns to display
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

# Configure logging
logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

logger.info("Imported libraries and configured logging.")

2024-09-22 16:55:28,999 - INFO - Imported libraries and configured logging.


**Preprocessing ...**

In [2]:
logger.info("Preprocessed both the test and train datasets")
# Load and preprocess the datasets
if __name__ == "__main__":
    train_path = '../data/train_cleaned.csv'  # Path to train dataset
    test_path = '../data/test_cleaned.csv'  # Path to test dataset
    test_id = '../data/test.csv'
    # Create instance of the class
    preprocessor = DataPreprocessor(train_path, test_path, test_id)
    # Load the dataset
    train_df, test_df = preprocessor.preprocess()
    # Save Preprocessed data
    preprocessor.save_data()


2024-09-22 16:55:29,025 - INFO - Preprocessed both the test and train datasets


Cleaning data...
Extracting datetime features...
Performing feature engineering...
Encoding categorical data...
Scaling numeric features...
Preprocessing complete.
Processed data saved to ../data/train_processed.csv and ../data/test_processed.csv.


In [3]:
train_df.columns

Index(['Store', 'DayOfWeek', 'Sales', 'Open', 'Promo', 'StateHoliday',
       'SchoolHoliday', 'StoreType', 'Assortment', 'CompetitionDistance',
       'Promo2', 'Weekday', 'IsWeekend', 'Month', 'DaysToHoliday',
       'DaysAfterHoliday', 'IsBeginningOfMonth', 'IsMidMonth', 'IsEndOfMonth',
       'IsHoliday', 'Promo_duration'],
      dtype='object')

In [4]:
test_df.columns

Index(['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday',
       'StoreType', 'Assortment', 'CompetitionDistance', 'Promo2', 'Weekday',
       'IsWeekend', 'Month', 'DaysToHoliday', 'DaysAfterHoliday',
       'IsBeginningOfMonth', 'IsMidMonth', 'IsEndOfMonth', 'IsHoliday',
       'Promo_duration'],
      dtype='object')

In [5]:
train_df.head()

Unnamed: 0_level_0,Store,DayOfWeek,Sales,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,Promo2,Weekday,IsWeekend,Month,DaysToHoliday,DaysAfterHoliday,IsBeginningOfMonth,IsMidMonth,IsEndOfMonth,IsHoliday,Promo_duration
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-07-31,-1.732571,0.858414,5263.0,0.0,1.113726,-0.029796,2.041038,0.582814,-0.942988,-0.649869,-0.997372,4,-0.460344,7,-1.733955,1.733955,-0.533778,-0.937685,1.49057,2.036435,-1.730687
2015-07-31,-1.729462,0.858414,6064.0,0.0,1.113726,-0.029796,2.041038,-0.884146,-0.942988,-0.784392,1.002635,4,-0.460344,7,-1.733955,1.733955,-0.533778,-0.937685,1.49057,2.036435,-1.730687
2015-07-31,-1.726354,0.858414,8314.0,0.0,1.113726,-0.029796,2.041038,-0.884146,-0.942988,1.821512,1.002635,4,-0.460344,7,-1.733955,1.733955,-0.533778,-0.937685,1.49057,2.036435,-1.730687
2015-07-31,-1.723246,0.858414,13995.0,0.0,1.113726,-0.029796,2.041038,0.582814,1.070916,-0.774783,-0.997372,4,-0.460344,7,-1.733955,1.733955,-0.533778,-0.937685,1.49057,2.036435,-1.730687
2015-07-31,-1.720138,0.858414,4822.0,0.0,1.113726,-0.029796,2.041038,-0.884146,-0.942988,2.206825,-0.997372,4,-0.460344,7,-1.733955,1.733955,-0.533778,-0.937685,1.49057,2.036435,-1.730687


In [6]:
test_df.head()

Unnamed: 0_level_0,Store,DayOfWeek,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,Promo2,Weekday,IsWeekend,Month,DaysToHoliday,DaysAfterHoliday,IsBeginningOfMonth,IsMidMonth,IsEndOfMonth,IsHoliday,Promo_duration
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,-1.732571,0.278263,0.0,1.113726,-0.029796,-0.489947,0.582814,-0.942988,-0.649869,-0.997372,3,-0.460344,9,-1.908824,1.908824,-0.533778,1.066456,-0.670884,-0.491054,-1.730687
2,-1.726354,0.278263,0.0,1.113726,-0.029796,-0.489947,-0.884146,-0.942988,1.821512,1.002635,3,-0.460344,9,-1.908824,1.908824,-0.533778,1.066456,-0.670884,-0.491054,-1.730687
3,-1.713922,0.278263,0.0,1.113726,-0.029796,-0.489947,-0.884146,1.070916,1.990146,-0.997372,3,-0.460344,9,-1.908824,1.908824,-0.533778,1.066456,-0.670884,-0.491054,-1.730687
4,-1.710813,0.278263,0.0,1.113726,-0.029796,-0.489947,-0.884146,-0.942988,0.55123,-0.997372,3,-0.460344,9,-1.908824,1.908824,-0.533778,1.066456,-0.670884,-0.491054,-1.730687
5,-1.707705,0.278263,0.0,1.113726,-0.029796,-0.489947,-0.884146,1.070916,-0.503815,-0.997372,3,-0.460344,9,-1.908824,1.908824,-0.533778,1.066456,-0.670884,-0.491054,-1.730687


In [7]:
test_df.shape, train_df.shape

((41088, 20), (844392, 21))