Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values.


ans.

We are building a pipeline that:

* Handles missing values
* Encodes categorical variables
* Scales numerical features
* Removes multicollinearity
* Selects important features automatically
* Trains and evaluates a Random Forest classifier

We'll use `scikit-learn`'s pipeline tools.


###  **Step 1: Sample Dataset Creation**

```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Sample dataset
df = pd.DataFrame({
    'age': [25, 30, np.nan, 35, 45, 50, np.nan, 40],
    'salary': [50000, 60000, 55000, np.nan, 70000, 65000, 62000, 72000],
    'experience': [1, 3, 2, 4, 5, 6, 7, 8],
    'gender': ['M', 'F', 'F', np.nan, 'M', 'M', 'F', 'F'],
    'purchased': [1, 0, 1, 0, 1, 0, 1, 0]
})

X = df.drop("purchased", axis=1)
y = df["purchased"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```


### 🔹 **Step 2: Preprocessing Pipelines**

####  Numerical Pipeline:

* Impute missing values using **mean**
* Scale using **StandardScaler**

####  Categorical Pipeline:

* Impute missing using **most frequent**
* Apply **OneHotEncoder**

```python
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Identify column types
num_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()

# Numerical pipeline
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine pipelines
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])
```


###  **Step 3: Add Feature Selection**

We'll use `SelectFromModel` with a `RandomForestClassifier` to select important features.

```python
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
```


###  **Step 4: Full Pipeline with Model**

```python
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Full pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selection', SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")
```



###  **Explanation of Steps**

| Step                     | Description                                                           |
| ------------------------ | --------------------------------------------------------------------- |
| **1. Imputation**        | Handles missing values using mean (numerical) and mode (categorical). |
| **2. Scaling**           | Normalizes numerical features for better ML model performance.        |
| **3. One-Hot Encoding**  | Converts categorical features into numeric form.                      |
| **4. Feature Selection** | Keeps only important features based on Random Forest importance.      |
| **5. Classification**    | Trains a `RandomForestClassifier` on the transformed data.            |
| **6. Evaluation**        | Accuracy is printed as the performance metric.                        |


###  **Interpretation of Results**

If you run the code above, you'll get something like:

```bash
Model Accuracy: 1.00
```

(Results will vary due to small dataset.)


###  **Possible Improvements**

1. **Cross-validation**: Use `cross_val_score` for robust accuracy.
2. **Hyperparameter tuning**: Use `GridSearchCV` for `RandomForestClassifier`.
3. **Outlier handling**: Add a step to remove or cap outliers.
4. **Feature interaction**: Try polynomial features or interactions.
5. **Use pipelines on larger real-world data** for better generalization.



Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.

##  Ensemble Voting Classifier Pipeline on Iris Dataset

###  **Goal**

* Use both **Random Forest** and **Logistic Regression** in a pipeline.
* Combine them using a **VotingClassifier**.
* Train on the **Iris dataset**.
* Evaluate the model accuracy.

---

##  **Theory**

| Component                    | Description                                                                                   |
| ---------------------------- | --------------------------------------------------------------------------------------------- |
| **Random Forest Classifier** | Ensemble of decision trees; good for capturing non-linear patterns.                           |
| **Logistic Regression**      | A simple, fast, and interpretable linear model.                                               |
| **Voting Classifier**        | Combines multiple models. Uses majority vote (for classification) to decide the final output. |
| **Pipeline**                 | Ensures consistent preprocessing and model execution.                                         |

---

##  **Code Implementation**

```python
# Step 1: Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Step 2: Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Step 3: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Create individual pipelines for models
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
])

lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression(max_iter=200, random_state=42))
])

# Step 5: Create voting classifier
voting_clf = VotingClassifier(
    estimators=[('rf', rf_pipeline), ('lr', lr_pipeline)],
    voting='hard'  # Use 'soft' for probabilities
)

# Step 6: Train and predict
voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)

# Step 7: Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Voting Classifier Accuracy on Iris Dataset: {accuracy:.2f}")
```



##  **Interpretation of Results**

| Metric   | Value                                           |
| -------- | ----------------------------------------------- |
| Accuracy | Typically \~0.93–1.00 depending on random split |

* **Random Forest** captures complex patterns.
* **Logistic Regression** adds robustness and interpretability.
* **VotingClassifier** benefits from the strengths of both.



##  **Possible Improvements**

1. Use `voting='soft'` for probability-based voting.
2. Tune hyperparameters with `GridSearchCV`.
3. Try other classifiers like `KNeighbors`, `SVC`, `GradientBoosting`.

