### 1. Why is OOP "Good"?

Right now, your code is likely a list of instructions: "Do A, then B, then C." This works for small scripts, but as projects grow, it becomes "Spaghetti Code", messy and hard to untangle.

OOP changes the mental model. Instead of a list of instructions, you build **smart objects** that know how to handle themselves.

* **Organization (Encapsulation):** You group related data (variables) and behavior (functions) into a single container. The `DataGenerator` keeps its secrets (like the random seed) to itself. The rest of your code doesn't need to worry about it.
* **Reusability (Blueprints):** A Class is like a **Blueprint**. Once you write the code for a `Car`, you can create 100 distinct cars without rewriting code.
* **Safety:** If a variable is hidden inside an object, other parts of your code can't accidentally change it and break things.

---

### 2. Anatomy of a Class

Think of a Class as a **template**.

* **Attributes (Variables):** What the object *knows* (e.g., color, size, name).
* **Methods (Functions):** What the object *does* (e.g., drive, bark, calculate).

#### The "Self" Concept (The most confusing part!)

You will see `self` everywhere.

* Imagine you have a blueprint for a Human.
* `self.name` means: "The name of **this specific** human I am building right now."
* Without `self`, the code wouldn't know if you meant "Humanity's name" or "Ujwal's name."

---

### 3. A Sample Class (The "Coffee Machine")

Let's look at a simple example unrelated to your housing data so you can see the structure clearly.


In [1]:
class CoffeeMachine:
    """
    A blueprint for a smart coffee machine.
    """
    
    # 1. The Setup (__init__)
    # This runs automatically effectively when you create a new object.
    # It sets the initial "State".
    def __init__(self, water_level, bean_type):
        self.water = water_level   # Attribute: The machine remembers its water level
        self.beans = bean_type     # Attribute: The machine remembers the bean type
        self.is_on = False         # Attribute: Default state is OFF

    # 2. A Method (Behavior)
    # This modifies the object's state
    def turn_on(self):
        self.is_on = True
        print("Machine is now ON.")

    # 3. Another Method (Logic)
    # This uses the object's state to do something
    def make_coffee(self):
        if not self.is_on:
            print("Error: Machine is OFF. Please turn it on.")
            return
        
        if self.water < 200:
            print("Error: Not enough water!")
        else:
            self.water = self.water - 200 # Update the state (consume water)
            print(f"Brewing a hot cup of {self.beans} coffee...")
            print(f"Water remaining: {self.water}ml")

# --- HOW TO USE IT ---

# 1. Create an Instance (Build the object from the blueprint)
# "my_machine" is now a real object.
my_machine = CoffeeMachine(water_level=500, bean_type="Arabica")

# 2. Interact with it
my_machine.make_coffee() # Fails because it's off
my_machine.turn_on()     # Changes state to ON
my_machine.make_coffee() # Success! Consumes water.
my_machine.make_coffee() # Success! Consumes water.
my_machine.make_coffee() # Fails! Not enough water left.

Error: Machine is OFF. Please turn it on.
Machine is now ON.
Brewing a hot cup of Arabica coffee...
Water remaining: 300ml
Brewing a hot cup of Arabica coffee...
Water remaining: 100ml
Error: Not enough water!


---

Coming back to Ensemble Methods

### 1. The Mental Model: "The Factory Line"

Think of your code as a factory floor. You need three distinct machines.

* **Machine A (Generator):** You give it raw settings; it spits out raw materials (Data).
* **Machine B (Preprocessor):** It takes raw materials, cleans them, and prepares them for assembly. It needs to remember how it cleaned the last batch so it can do the same for the next.
* **Machine C (Trainer):** It takes prepared materials and learns how to build the final product (Model).

### 2. How to Design Your Classes

In Python, a class generally needs two things:

1. **State (Attributes/`self`):** What does the machine need to *know* or *remember*? (e.g., the random seed, the saved model, the scaler settings).
2. **Behavior (Methods):** What does the machine *do*? (e.g., `generate()`, `clean()`, `train()`).

Here is how you should structure your three classes:

#### Class 1: The Data Generator

* **Goal:** Encapsulate all the fake data logic so it doesn't clutter your main code.
* **What it needs to know (`__init__`):**
* How many samples do you want? (`n=1000`)
* What is the random seed? (So it's reproducible).


* **What it does (Methods):**
* `create_data()`: This method should contain that updated math/logic we discussed (linear relationship). It returns a DataFrame.



#### Class 2: The Preprocessor (CRITICAL)

* **Goal:** This is the trickiest one. It must **remember** the math it used on the training set (e.g., the mean value for filling missing data) so it can apply the *exact same math* to the test set.
* **What it needs to know (`__init__`):**
* It needs a place to store the "pipeline" or "transformer" object once it's created. Initially, this is `None`.


* **What it does (Methods):**
* `fit_transform(X_train)`: Learns the mean/mode from training data and transforms it. **Save the learner to `self**`.
* `transform(X_test)`: Uses the *saved* learner to transform new data. **Do not learn new means here!**



#### Class 3: The Model Trainer

* **Goal:** Handle the messy parts of machine learning (GridSearch, fitting, predicting).
* **What it needs to know (`__init__`):**
* Which algorithm are we using? (XGBoost).
* The model object itself.


* **What it does (Methods):**
* `train(X, y)`: Runs the GridSearch you learned earlier. It should find the best parameters and update `self.model` with the winner.
* `evaluate(X, y)`: Uses `self.model` to predict and return the error score (RMSE).



---

### 3. What to Look Out For (The "Gotchas")

**1. The "State" Trap (Data Leakage)**

* *The Mistake:* Creating a preprocessor inside a function, using it, and letting it die.
* *The Fix:* Your `Preprocessor` class must survive between the training step and the testing step. If you re-initialize the class before testing, it "forgets" the training mean and calculates a new mean from the test data. **This is cheating (Data Leakage).**

**2. Hardcoding Values**

* *The Mistake:* Writing `price = size * 150` inside the generate method and leaving it there.
* *The Fix:* Try to pass these "magic numbers" as arguments or constants if you can, but for now, just keeping them inside the class is better than global variables.

**3. The "God Object"**

* *The Mistake:* Making one giant class called `HousingSystem` that does everything.
* *The Fix:* Keep them separate. The `Generator` shouldn't know that XGBoost exists. The `Trainer` shouldn't care how the data was cleaned, only that it *is* cleaned.

### Your Task

Don't worry about the whole thing yet. **Start by writing just the first class: `HousingDataGenerator`.**

Draft it out:

1. Define the class.
2. Write the `__init__` to accept `n_samples`.
3. Write a method called `get_data()` that includes your *fixed* logic (with the price formula).

In [15]:
import pandas as pd
import numpy as np
from sklearn.model_selection  import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

class HousingDataGenerator():
    """
    Encapsulate all the fake data logic so it doesn't clutter your main code.
    """
    def __init__(self, n_samples = 1000, seed_val = 42):
        self.n_samples = n_samples
        self.seed_val = seed_val


    def get_data(self):
            np.random.seed(self.seed_val)
            n = self.n_samples
            
            # 1. Generate Features
            size = np.random.normal(1500, 500, n)
            year = np.random.randint(1950, 2024, n)
            neighborhood = np.random.choice(['Downtown', 'Suburb', 'Rural'], n)
            style = np.random.choice(['Modern', 'Victorian', 'Ranch', np.nan], n)
            
            # 2. Define the "Rules" (The Relationship)
            # Start with a Base Price
            price = 50000 + (size * 150) + ((year - 1950) * 1000)
            
            # Add value for Neighborhoods
            # (We add boolean logic: if Downtown, add 50k, etc.)
            price += np.where(neighborhood == 'Downtown', 50000, 0)
            price += np.where(neighborhood == 'Suburb', 20000, 0)
            
            # 3. Add some "Noise" (Randomness)
            # Real life isn't perfect math, so we add a little random fluctuation
            noise = np.random.normal(0, 20000, n)
            final_price = price + noise
            
            df = pd.DataFrame({
                'Size_sqft': size,
                'Year_Built': year,
                'Neighborhood': neighborhood,
                'Style': style,
                'Price': final_price 
            })
            # Add NaNs as before
            df.loc[np.random.choice(n, 50), 'Size_sqft'] = np.nan
            return df
        

This is **perfect**. You have successfully created your first "Machine" in the factory.

* **Structure:** Your `__init__` correctly stores the settings.
* **Logic:** Your `get_data` correctly uses `self.n_samples` and `self.seed_val`.
* **Result:** This class is now a reusable tool. You can create `gen1 = HousingDataGenerator(n_samples=500)` and `gen2 = HousingDataGenerator(n_samples=5000)` without rewriting any logic.

---

### Class 2: The "Cleaner" (DataPreprocessor)

This is the hardest class, so we will take it slowly.

**The Goal:**
In your script version, you probably did `fit_transform` on the whole dataset or re-fitted on the test set. **That is dangerous.**
The `DataPreprocessor` class must act like a smart robot that:

1. **Learns** from the Training Data (calculates means, modes).
2. **Remembers** those values (stores them in `self`).
3. **Applies** exactly those values to the Test Data (no peeking!).


### Class: `DataPreprocessor`

**Goal:** Create a machine that learns transformation rules (imputing, scaling) from training data and applies them strictly to test data.

#### Step 1: The `__init__` Method

* **Action:** Initialize a variable (let's call it `self.preprocessor`) to `None`.
* **Why:** This variable will eventually hold your "Pipeline" object (the machine that does the actual cleaning). We start with `None` because we haven't built the machine yet.

#### Step 2: The `create_pipeline` Method

* **Action:** This is where you define your cleaning rules. You need to create a `ColumnTransformer`.
* **Instruction A (Numeric):** Create a pipeline for numeric columns (`Size_sqft`, `Year_Built`).
    * It should first Impute missing values (using 'mean').
    * It should then Scale values (StandardScaler).


* **Instruction B (Categorical):** Create a pipeline for categorical columns (`Neighborhood`, `Style`).
    * It should first Impute missing values (using 'most_frequent').
    * It should then One-Hot Encode (handle_unknown='ignore').


* **Instruction C (Combine):** Use `ColumnTransformer` to bundle these two pipelines together.

* **Critical Final Step:** Assign this `ColumnTransformer` object to `self.preprocessor`.

#### Step 3: The `process_and_split` Method

* **Input:** Accepts a raw DataFrame (`df`) and the name of the target column (`target_col`).
* **Logic:**
   1. **Check:** If `self.preprocessor` is still `None`, run `self.create_pipeline()` to build it.
   2. **Separate:** Split your DataFrame into Features (`X`) and Target (`y`).
   3. **Split:** Use `train_test_split` to create your 4 arrays: `X_train`, `X_test`, `y_train`, `y_test`.
   4. **The Golden Rule:**
* On `X_train`: Run **`fit_transform`** using `self.preprocessor`. (This learns the means/standard deviations AND changes the data).
* On `X_test`: Run **`transform`** using `self.preprocessor`. (This only changes the data using the *learned* means. It does NOT learn new ones).

* **Return:** The four processed arrays.

---

**Your Turn:**
Write the `DataPreprocessor` class based on these instructions. Focus on getting the `fit_transform` vs `transform` logic correct in step 3—that is the most common interview question in this field!

In [24]:
class DataPreprocessor:
    """
    Create a machine that learns transformation rules (imputing, scaling) from training data and applies them strictly to test data.
    """
    def __init__(self):
        self.preprocessor = None
        self.random_seed = 42
        self.test_size = 0.2

    
    def create_pipeline(self):
        """
        Here is where you define the RULES for cleaning.
        (Imputers, Scalers, OneHotEncoders)
        """
        num_cols = ["Size_sqft", "Year_Built"]
        cat_cols = ["Neighborhood", "Style"]
        
        num_pipeline = Pipeline([
            ## Name, Function
            ("impute", SimpleImputer(strategy = "mean")),
            ("scale", StandardScaler())
        ])

        cat_pipeline = Pipeline([
            ## Name, Function
            ("impute", SimpleImputer(strategy = "most_frequent")),
            ("encoder", OneHotEncoder(handle_unknown='ignore'))
        ])

        transform = ColumnTransformer(
            transformers = [
                ## Name, Pipeline, Columns
                ("num", num_pipeline, num_cols),
                ("cat", cat_pipeline, cat_cols)
            ],
            remainder = "passthrough"
        )
        self.preprocessor = transform
        

    def test_train_fit_data(self, df, target_col = "Price"):
        """
        This splits and fits data from the df to test and training set
        """
        
        if self.preprocessor is None:
            self.create_pipeline()
            
        X = df.drop(columns = target_col, axis = 1)
        y = df[target_col]

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = self.test_size, random_state = self.random_seed)
        
        X_train_processed = self.preprocessor.fit_transform(X_train) ## This learns the means/standard deviations AND changes the data
        X_test_processed = self.preprocessor.transform(X_test) ## This only changes the data using the learned means. It does NOT learn new ones
        
        return X_train_processed, X_test_processed, y_train, y_test



### Class 3: `ModelTrainer`

**Goal:** Initialize three different models, train them all, compare their scores, and save the winner.

#### Step 1: The `__init__` Method

* **Action:**
* Create a dictionary called `self.models`.
* **Keys:** Strings like "Linear", "RandomForest", "XGBoost".
* **Values:** The actual model objects (e.g., `LinearRegression()`, `RandomForestRegressor(random_state=42)`, etc.).
* Initialize `self.best_model` to `None`.


#### Step 2: The `train_and_evaluate` Method

* **Input:** Accepts all your data: `X_train`, `y_train`, `X_test`, `y_test`.
* **Logic (The Loop):**
* Create a variable `best_rmse` and set it to `float('inf')` (infinity) so the first model always beats it.
* **Loop** through your `self.models` dictionary.
* For each model:
   1. **Fit** it on `X_train` and `y_train`.
   2. **Predict** on `X_test`.
   3. **Calculate RMSE** (Root Mean Squared Error).
   4. **Print** the result (e.g., `"Linear Regression RMSE: 10200"`).
   5. **Compare:** If this model's RMSE is lower than `best_rmse`:
* Update `best_rmse`.
* Save this model to `self.best_model`.


* **Final Output:** Print which model won and the final score.

#### Optional (Bonus Step): Hyperparameters

* If you want to keep it simple, just use standard `.fit()`.
* If you want to be fancy, you can add a `GridSearchCV` inside the loop, but I recommend getting the simple loop working first before adding Grid Search back in.

---


In [27]:
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import RandomForestRegressor 
from xgboost import XGBRegressor 
from sklearn.metrics import mean_squared_error
import numpy as np

class ModelTrainer():
    """
     Creates three different models, train them all, compare their scores, and save the winner.
    """

    def __init__(self):
        self.n_estimators = 100
        self.learning_rate = 0.1
        self.random_seed = 42
        self.best_rmse = float("inf")
        
        self.models = {
            "Linear": LinearRegression(),
            "RandomForest": RandomForestRegressor(random_state = self.random_seed),
            "XGBoost": XGBRegressor(n_estimators = self.n_estimators, 
                                    learning_rate = self.learning_rate, 
                                    random_state = self.random_seed)
        }
        self.best_model = None
        self.best_model_name = None
        

    def train_and_evaluate(self, X_train, y_train, X_test, y_test):
        for k, v in self.models.items():
            print(f"Starting evalutation for {k}")
            model_fit = v.fit(X_train, y_train)
            y_pred = model_fit.predict(X_test)
            rmse = np.sqrt(mean_squared_error(y_test, y_pred))
            if rmse < self.best_rmse :
                self.best_rmse = rmse
                self.best_model = v
                self.best_model_name = k
            print(f"Root Mean Squared Error = {rmse}\n==========================")

        print(f"The best model is {self.best_model_name} with a RMSE = {self.best_rmse}")

        return self.best_model_name, self.best_model
            


You have built:

1. **`HousingDataGenerator`**: Creates data.
2. **`DataPreprocessor`**: Cleans data.
3. **`ModelTrainer`**: Trains and compares models.

Now, we need the **Main Execution Block** to tie them all together. This is where you push the "Start" button on your factory.

In [28]:
if __name__ == "__main__":
    df = HousingDataGenerator().get_data()
    
    DP = DataPreprocessor()
    X_train, X_test, y_train, y_test = DP.test_train_fit_data(df)
    
    mt = ModelTrainer()
    
    best_model_name, best_model = mt.train_and_evaluate(X_train, y_train, X_test, y_test)
    

Starting evalutation for Linear
Root Mean Squared Error = 31316.214460850868
Starting evalutation for RandomForest
Root Mean Squared Error = 37729.77183190089
Starting evalutation for XGBoost
Root Mean Squared Error = 37646.25156837973
The best model is Linear with a RMSE = 31316.214460850868



### Class 4: The "Shipper" (ModelSaver)

A model is useless if it stays stuck in your Python script. In the real world, we need to **save** it to a file so we can send it to a website, an app, or a cloud server.

We call this "Serialization" (or "Pickling").

#### "Serialization" (Saving the Game)

Imagine you are playing a video game. If you turn off the console without saving, you lose all your progress.

   * **Running your Python script** is like playing the game. The model learns and gets smarter.
   * **Closing Python** is like turning off the console. The model "dies" and forgets everything.
   * **"Pickling" (Serialization)** is simply hitting the **"Save Game"** button. It freezes your smart model into a file (like `savefile.pkl`) so you can open it tomorrow, or send it to a friend, and it remembers everything it learned.


**Your Final Challenge:** Write the `ModelSaver` class.

#### The "Gotcha" (Critical Concept) ⚠️

Most beginners just save the model. **That is a mistake.**
If you load the model later and try to predict a house with `Size=2000`, the model will crash because it expects **scaled data** (e.g., `Size=0.5`), not raw numbers.

**You must save BOTH:**

1. The Trained Model (`best_model`)
2. The Fitted Preprocessor (`DP.preprocessor`)

**The Problem:** Your model DOES NOT understand real-world numbers anymore.
   * You taught it that a "big house" is `Size = 1.5` (scaled) and a "small house" is `Size = -0.5`.
   * If a user on your website types `Size = 2000`, the model will panic. It thinks `2000` is a gigantic, impossible number because it expects small, scaled numbers.

   * **The Preprocessor:** This is your **Translator**. It knows exactly how to turn "2000" into "1.5".
   * **The Lesson:** If you save *only* the Model, you are saving a genius who speaks a secret language, but you forgot to pack their dictionary. You must save **both** the Model (the brain) and the Preprocessor (the translator) together.


#### Instructions for Class 4: `ModelSaver`

1. **Library:** You will need `import joblib` (standard for saving AI models).
2. **`__init__`**: It doesn't strictly *need* anything, but you could pass a default folder name if you wanted. For now, keep it empty.
3. **`save(self, model, preprocessor, filename)`**:
* Create a dictionary to bundle them together:
```python
artifact = {
    "model": model,
    "preprocessor": preprocessor
}

```


* Use `joblib.dump(artifact, filename)` to save that dictionary to a file (usually ending in `.pkl` or `.joblib`).
* Print a confirmation message ("Model saved to...").


In [36]:
import joblib

class ModelSaver:
    """
    Saves the Model and the Preprocessor to a single file.
    """
    def __init__(self, filename = r"Data\m1-l3-model.pkl"):
        self.filename = filename

    def save_model(self, model, preprocessor, model_name):
        # Bundle them together!
        # This is the crucial step: We save a DICTIONARY containing both.
        artifact = {
            "model": model,
            "model_name": model_name,
            "preprocessor": preprocessor
        }
        
        # 'dump' means 'save to file'
        joblib.dump(artifact, self.filename)
        print(f"Success! Model {model_name} and Preprocessor saved to {self.filename}")


In [37]:
saver = ModelSaver()
saver.save_model(best_model, DP.preprocessor, best_model_name)

Success! Model Linear and Preprocessor saved to Data\m1-l3-model.pkl
