<a href="https://colab.research.google.com/github/drsubirghosh2008/drsubirghosh2008/blob/main/PW_Assignment_Module26_7_11_24_Ensemble_Techniques_And_Its_Types_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

Design a pipeline that includes the following steps:

*Use an automated feature selection method to identify the important features in
the dataset
*Create a numerical pipeline that includes the following steps:
*Impute the missing values in the numerical columns using the mean of the column values
*Scale the numerical columns using standardisation
*Create a categorical pipeline that includes the following steps:
*Impute the missing values in the categorical columns using the most frequent value of the column.
*One-hot encode the categorical columns.
*Combine the numerical and categorical pipelines using a ColumnTransformer.
*Use a Random Forest Classifier to build the final model.
*Evaluate the accuracy of the model on the test dataset.

Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.

Answer:

Here’s a complete pipeline for your machine learning project that addresses feature selection, preprocessing for numerical and categorical columns, and model building with a Random Forest Classifier. This solution also includes code snippets and explanations for each step.

Method-1:

In [1]:
#Import Required Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score


In [None]:
# Load Data and Split into Train and Test Sets
# Assuming `df` is your DataFrame, with 'target' as the target variable
X = df.drop(columns=['target'])
y = df['target']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Automated Feature Selection
# Using a Random Forest to identify important features:

# Initialize a Random Forest model for feature selection
feature_selector = RandomForestClassifier(n_estimators=100, random_state=42)
feature_selector.fit(X_train, y_train)

# Use SelectFromModel to retain important features
sfm = SelectFromModel(feature_selector, threshold="mean")
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)

# Get selected feature names for future reference
selected_features = X_train.columns[sfm.get_support()]


Explanation: This code trains a Random Forest model and uses the SelectFromModel method to select features based on importance. Only features with importance above the mean are retained. The feature selection reduces dimensionality, potentially improving model performance and interpretability.

In [None]:
# Define Numerical and Categorical Pipelines
# a. Numerical Pipeline

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),   # Impute missing values with column mean
    ('scaler', StandardScaler())                   # Standardize features
])


Explanation: The numerical_pipeline imputes missing values in the numerical columns using the column mean and standardizes the data, making each feature have mean zero and unit variance. This scaling is often beneficial for models sensitive to feature magnitudes.

In [None]:
# b. Categorical Pipeline
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # Impute missing values with mode
    ('onehot', OneHotEncoder(handle_unknown='ignore'))    # One-hot encode categorical columns
])


Explanation: The categorical_pipeline fills missing values in the categorical columns with the most frequent value and then one-hot encodes the categorical features. handle_unknown='ignore' prevents issues with new categories in the test set.

In [None]:
# Combine Pipelines with ColumnTransformer

from sklearn.compose import ColumnTransformer

# Identify the numerical and categorical columns
numerical_features = X_train[selected_features].select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train[selected_features].select_dtypes(include=['object', 'category']).columns

# Combine pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ]
)


Explanation: ColumnTransformer applies the numerical_pipeline to the numerical features and the categorical_pipeline to the categorical features. This modular design allows separate processing of feature types in one unified pipeline.

In [None]:
# Build the Final Pipeline and Train the Model

# Create a complete pipeline with preprocessing and model training
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train the pipeline on selected features
model_pipeline.fit(X_train[selected_features], y_train)


In [None]:
# Evaluate the Model on the Test Set

# Predict on the test set
y_pred = model_pipeline.predict(X_test[selected_features])

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.2f}")


Explanation: We combine preprocessing and model training into a single pipeline. This pipeline is trained on the training data and then used to make predictions on the test set. Finally, the accuracy_score metric evaluates the model's performance on the test data.



Interpretation of Results and Potential Improvements
Interpretation: The accuracy score reflects the model's performance. If the score is satisfactory, the pipeline can be considered effective for this task. Otherwise, consider further tuning.

Possible Improvements:

Hyperparameter Tuning: Use GridSearchCV or RandomizedSearchCV to optimize the RandomForestClassifier parameters, such as n_estimators and max_depth.
Feature Engineering: Try creating new features or applying domain-specific transformations to improve model performance.
Outlier Handling: Consider using robust scaling or removing outliers to reduce noise in the data.
Model Selection: Explore other models like Gradient Boosting or XGBoost, which may perform better depending on the data characteristics.
This pipeline provides a strong foundation for automating feature engineering, handling missing values, and building a reliable model.

Method -2:

To create a robust pipeline that addresses feature selection, missing values, and scaling, here’s a code example using Python with sklearn. This approach will help automate key steps in the machine learning workflow.

In [None]:
# Required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

# Sample data
# Assuming 'df' is your dataset with features and 'target' is the target column.
X = df.drop('target', axis=1)
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 1. Automated Feature Selection
# Use a preliminary RandomForestClassifier to select important features.
feature_selector = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))
X_train_selected = feature_selector.fit_transform(X_train, y_train)
X_test_selected = feature_selector.transform(X_test)

# Identify numerical and categorical columns
num_features = X_train.select_dtypes(include=['int64', 'float64']).columns
cat_features = X_train.select_dtypes(include=['object']).columns

# 2. Numerical Pipeline
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with mean
    ('scaler', StandardScaler())  # Standardize numerical values
])

# 3. Categorical Pipeline
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute with most frequent value
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical variables
])

# 4. Combine pipelines with ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

# 5. Full Pipeline with RandomForestClassifier
pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train the model
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Model accuracy: {accuracy:.2f}')


Explanation:

Feature Selection:

 SelectFromModel uses a preliminary RandomForestClassifier to retain only important features, reducing dimensionality and possibly improving model performance.

Numerical Pipeline:

This pipeline imputes missing numerical values using the column mean and scales the features to have mean zero and unit variance, enhancing the model's sensitivity to feature magnitudes.

Categorical Pipeline:

Categorical values are imputed with the most frequent category, and one-hot encoding transforms these columns into a format suitable for machine learning models.

ColumnTransformer:

 Combines numerical and categorical preprocessing pipelines, handling them separately before feeding into the classifier.

RandomForestClassifier:

The selected model for classification is a robust choice for handling complex datasets and helps in interpreting feature importance.

Interpretation of Results and Improvements:

The pipeline outputs model accuracy, giving insight into overall performance. To improve accuracy, consider:

Hyperparameter Tuning:

Use GridSearchCV or RandomizedSearchCV for optimal hyperparameters in RandomForestClassifier.

Feature Engineering:

Explore polynomial or interaction terms for possible feature enhancement.

Outlier Handling:

Analyze outliers as they can affect the model’s accuracy, particularly if they lead to skewed mean imputation.

These steps ensure a streamlined, automated workflow that handles missing values, scales features, and performs robust classification.

Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

Answer:

To build a pipeline with both a Random Forest and a Logistic Regression classifier, we can use a VotingClassifier to combine their predictions for a final output. Here’s the implementation for this on the Iris dataset:

In [3]:
# Required libraries
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize individual classifiers
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
lr_clf = LogisticRegression(max_iter=200, random_state=42)

# Combine classifiers in a VotingClassifier
voting_clf = VotingClassifier(
    estimators=[('rf', rf_clf), ('lr', lr_clf)],
    voting='soft'  # Soft voting uses the predicted probabilities for averaging
)

# Create a pipeline with the voting classifier
pipeline = Pipeline([
    ('voting', voting_clf)
])

# Train the model
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Voting Classifier accuracy on Iris dataset: {accuracy:.2f}')


Voting Classifier accuracy on Iris dataset: 1.00


Explanation:

Load and Split Data: The Iris dataset is loaded, and we split it into training and test sets.

Initialize Classifiers:

We initialize both the RandomForestClassifier and LogisticRegression. Logistic Regression is set to a higher max_iter to ensure convergence.

Voting Classifier:

The VotingClassifier combines predictions from both classifiers. Here, we use soft voting, which averages predicted probabilities, often leading to better performance when classifiers output probabilities.

Pipeline:

A pipeline wraps the voting classifier, making it easy to manage and extend.

Train and Evaluate:

 We train the pipeline and evaluate its accuracy on the test set.

Interpretation:

The accuracy printed at the end indicates the model's performance on the test set. The VotingClassifier often benefits from combining the strengths of both classifiers (e.g., Random Forest for non-linear relationships and Logistic Regression for interpretability). For further improvement, consider fine-tuning hyperparameters or adding additional classifiers to the ensemble.

**Thank You!**