Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing valuesD


Design a pipeline that includes the following steps"
Use an automated feature selection method to identify the important features in the datasetC
Create a numerical pipeline that includes the following steps"
Impute the missing values in the numerical columns us#ng the mean of the column valuesC
Scale the numerical columns using standardisationC
Create a categorical pipeline that includes the following steps"
Impute the missing values in the categorical columns using the most frequent value of the columnC
One-hot encode the categorical columnsC
Combine the numerical and categorical pipelines using a ColumnTransformerC
Use a Random Forest Classifier to build the final modelC
Evaluate the accuracy of the model on the test datasetD

# Feature Engineering Pipeline

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

# Define the pipeline for numerical features
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with mean
    ('scaler', StandardScaler())  # Scale the numerical columns
])

# Define the pipeline for categorical features
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with most frequent value
    ('onehot', OneHotEncoder())  # One-hot encode the categorical columns
])

# Combine numerical and categorical pipelines using ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', numerical_pipeline, numerical_columns),  # Numerical columns
    ('cat', categorical_pipeline, categorical_columns)  # Categorical columns
])

# Final pipeline with feature selection and Random Forest Classifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),  # Preprocessing steps
    ('feature_selection', SelectFromModel(RandomForestClassifier())),  # Feature selection
    ('classifier', RandomForestClassifier())  # Random Forest Classifier
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Evaluate accuracy on the test dataset
accuracy = pipeline.score(X_test, y_test)
print("Accuracy:", accuracy)

ModuleNotFoundError: No module named 'sklearn'

Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.

# Pipeline with Random Forest and Logistic Regression

In [2]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Define the individual classifiers
rf_classifier = RandomForestClassifier()
lr_classifier = LogisticRegression()

# Create a Voting Classifier combining the classifiers
voting_classifier = VotingClassifier(estimators=[('rf', rf_classifier), ('lr', lr_classifier)], voting='hard')

# Train the pipeline
voting_classifier.fit(X_train, y_train)

# Evaluate accuracy on the test dataset
accuracy = accuracy_score(y_test, voting_classifier.predict(X_test))
print("Accuracy:", accuracy)

ModuleNotFoundError: No module named 'sklearn'