# üìäDATA PREPROCESSING WORKBOOK
---

# üßπ Data Pre-processing: The Foundation

### üìñ The Definition
Data pre-processing is the strategic process of **cleaning**, **organizing**, and **preparing** raw data. The objective is to transform raw information so that it becomes **accurate**, **consistent**, and **usable** for analysis or machine learning models.

### ‚öôÔ∏è The Analogy
Think of it like **cleaning and arranging your workspace** before starting a critical project.

* If your desk is messy, files are missing, or notes are duplicated, your workflow will be chaotic and prone to errors.
* Similarly, if the input data is incomplete, duplicated, or inconsistent, any prediction the model makes will be wrong.
* *In the business world, this is often summarized by the rule: **"Garbage In, Garbage Out."*** üóëÔ∏è‚û°Ô∏èüìâ

![Data Cleaning Process](cleaning.jpeg)

In [None]:
# All imports
try:
    import pandas as pd
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    from sklearn.model_selection import train_test_split
    print("‚úÖ All Libraries and Modules Imported Succesfully!")
except:
    print("‚ùå Error in importing Libraries & modules!")

## üìàLoad the data

In [None]:

try:
    df = pd.read_csv("sales_data.csv") #df stands for dataframe
    print("‚úÖ Successfully Loaded Data!")
except:
    print("‚ùå Error in Loading Data!")

## ‚ñ∂Ô∏èPreview

In [None]:
df.head()

# Some Basic Data Cleaning

# 1. üîéHandle missing values


In [None]:

# TODO: Fill missing values in the "Revenue" column with the mean
df["Revenue"] = df["Revenue"].fillna( ____ )

# 2. üóëÔ∏èRemove duplicate entries


In [None]:

# checks for duplicate entries and removes them 
df = df.drop_duplicates()

# 3. üì†Encode categorical column "Region"
 This code is performing Data Transformation to prepare text data for a statistical model.
  
- The **Problem**: Algorithms are mathematical engines; they can't do math on text labels like "North" or "South."
- The **Solution**: This converts the "Region" column into Dummy Variables (binary 1s and 0s).
- The **Result**: "North" becomes 1, while other regions become 0, allowing the model to mathematically weigh the impact of each region.

In [None]:
encoder = OneHotEncoder()
encoded = encoder.fit_transform(df[["Region"]]).toarray()

# 4. ‚ûïAdd encoded columns to dataframe


In [None]:
df[encoder.get_feature_names_out()] = encoded
df = df.drop(columns=["Region"])   # Remove original column

# 5. üìùNormalize the numerical features


In [None]:
scaler = StandardScaler()
df["Revenue_scaled"] = scaler.fit_transform(df[["Revenue"]])

# 6. üì¶Prepare features and target

In [None]:

X = df.drop(columns=["Monthly_Sales"])
y = df["Monthly_Sales"]

# TODO: Split into train and test (fill test_size)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= ____ )

print("‚úÖ Preprocessing Complete!")

## Final Preview

In [None]:
df.head()

# CONGRATULATIONS!! üéâüéâ‚ú®


## You have successfully learnt Data-cleaning!!