<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Data Preprocessing Functions</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            line-height: 1.6;
            margin: 20px;
        }
        h1, h2 {
            color: #656;
        }
        code {
            background-color:rgb(59, 72, 89);
            padding: 2px 4px;
            border-radius: 4px;
        }
        pre {
            background-color:rgb(74, 48, 92);
            padding: 10px;
            border-radius: 4px;
        }
    </style>
</head>
<body>

<h1>Data Preprocessing Functions</h1>

<h2>1. Import Libraries</h2>
<p>This function imports the necessary libraries for data manipulation and analysis.</p>
<pre><code>import pandas as pd
import numpy as np</code></pre>

<h2>2. Load the Data</h2>
<p>This function loads the training and test datasets from CSV files into pandas DataFrames.</p>
<pre><code>self.X_train = pd.read_csv('path_to_train_data.csv')
self.X_test = pd.read_csv('path_to_test_data.csv')</code></pre>

<h2>3. Handle Missing Values</h2>
<p>This function fills missing values in specified columns with 0 to ensure that the model can handle these gaps.</p>
<pre><code>columns_to_fill = ['Column1', 'Column2']  # Replace with actual column names
for col in columns_to_fill:
    self.X_train[col] = self.X_train[col].fillna(0)
    self.X_test[col] = self.X_test[col].fillna(0)</code></pre>

<h2>4. Encoding Categorical Variables</h2>
<p>This function replaces categorical values in specific columns with numerical representations to prepare for model training.</p>
<pre><code>self.X_train['StateHoliday'] = self.X_train['StateHoliday'].replace({'0': 0, 'a': 1, 'b': 2, 'c': 3}).infer_objects(copy=False)
self.X_test['StateHoliday'] = self.X_test['StateHoliday'].replace({'0': 0, 'a': 1, 'b': 2, 'c': 3}).infer_objects(copy=False)</code></pre>

<h2>5. Drop Unnecessary Columns</h2>
<p>This function removes columns that are not needed for analysis, such as 'CompetitionOpenSinceMonth' and 'CompetitionOpenSinceYear'.</p>
<pre><code>columns_to_drop = ['CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear']
self.X_train = self.X_train.drop(columns=[col for col in columns_to_drop if col in self.X_train.columns], axis=1)
self.X_test = self.X_test.drop(columns=[col for col in columns_to_drop if col in self.X_test.columns], axis=1)</code></pre>

<h2>6. Add 'sales' Column to Test Data</h2>
<p>This function adds a 'sales' column to the test DataFrame and initializes all values to null, preparing for future data entry.</p>
<pre><code>self.X_test['sales'] = np.nan</code></pre>

<h2>7. Summary of Preprocessing</h2>
<p>This function displays the processed training and test datasets to verify that all preprocessing steps have been applied correctly.</p>
<pre><code>print("Processed Training Data:")
display(self.X_train.head())

print("Processed Test Data:")
display(self.X_test.head())</code></pre>

</body>
</html>


In [2]:
import sys
import os

In [3]:
sys.path.append(os.path.abspath('..'))

In [4]:
from scripts.task2preprocessing import DataProcessor

we have imported all the required libraries. now we instantiate variables with  data path in them and we will load them to the class.

In [5]:
train_path="C:\\Users\\nadew\\10x\\week4\\Rosmann\\rossmann-store-sales\\train.csv"
test_path="C:\\Users\\nadew\\10x\\week4\\Rosmann\\rossmann-store-sales\\test.csv" 
store_path= "C:\\Users\\nadew\\10x\\week4\\Rosmann\\rossmann-store-sales\\store.csv"

In [6]:
D_pros = DataProcessor(train_path, test_path, store_path)

In [7]:
D_pros.load_data()

In [8]:
D_pros.preprocess_data()

  self.X_train['StateHoliday'] = self.X_train['StateHoliday'].replace({'0': 0, 'a': 1, 'b': 2, 'c': 3})
  self.X_test['StateHoliday'] = self.X_test['StateHoliday'].replace({'0': 0, 'a': 1, 'b': 2, 'c': 3})
  self.X_train['Assortment'] = self.X_train['Assortment'].replace({'a': 1, 'b': 2, 'c': 3})
  self.X_test['Assortment'] = self.X_test['Assortment'].replace({'a': 1, 'b': 2, 'c': 3})
  self.X_train['StoreType'] = self.X_train['StoreType'].replace({'a': 1, 'b': 2, 'c': 3, 'd': 4})
  self.X_test['StoreType'] = self.X_test['StoreType'].replace({'a': 1, 'b': 2, 'c': 3, 'd': 4})


In [9]:
D_pros.split_data()

In [10]:
D_pros.get_data()

(         Store  DayOfWeek        Date  Sales  Customers  Open  Promo  \
 0            1          5  2015-07-31   5263        555     1      1   
 1            2          5  2015-07-31   6064        625     1      1   
 2            3          5  2015-07-31   8314        821     1      1   
 3            4          5  2015-07-31  13995       1498     1      1   
 4            5          5  2015-07-31   4822        559     1      1   
 ...        ...        ...         ...    ...        ...   ...    ...   
 1017204   1111          2  2013-01-01      0          0     0      0   
 1017205   1112          2  2013-01-01      0          0     0      0   
 1017206   1113          2  2013-01-01      0          0     0      0   
 1017207   1114          2  2013-01-01      0          0     0      0   
 1017208   1115          2  2013-01-01      0          0     0      0   
 
         StateHoliday  SchoolHoliday StoreType Assortment  CompetitionDistance  \
 0                  0              1    