# Q.1

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

# Define the feature selection step
selector = SelectKBest(score_func=f_classif, k=10)

# Define the numerical pipeline
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Define the categorical pipeline
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder())
])

# Combine the numerical and categorical pipelines using ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipeline, ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']),
    ('cat', cat_pipeline, ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal'])
])

# Combine the feature selection and preprocessor using a pipeline
feature_engineering_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('selector', selector)
])

# Define the final Random Forest Classifier pipeline
rf_pipeline = Pipeline([
    ('feature_engineering', feature_engineering_pipeline),
    ('classifier', RandomForestClassifier())
])

# Split the dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the pipeline on the training data
rf_pipeline.fit(X_train, y_train)

# Evaluate the pipeline on the test data
accuracy = rf_pipeline.score(X_test, y_test)

print("Accuracy: {:.2f}%".format(accuracy*100))


In this pipeline, we first define a feature selection step using the SelectKBest method with the f_classif score function. This step selects the top k=10 most important features based on their ANOVA F-value.

Next, we define two separate pipelines for handling the numerical and categorical features. For the numerical pipeline, we use SimpleImputer to replace missing values with the mean of the column values, and StandardScaler to scale the features to have zero mean and unit variance. For the categorical pipeline, we use SimpleImputer to replace missing values with the most frequent value of the column, and OneHotEncoder to one-hot encode the categorical features.

We then combine the numerical and categorical pipelines using the ColumnTransformer, which applies the appropriate pipeline to each column of the input data.

Next, we combine the feature selection and preprocessor pipelines using another pipeline. This pipeline first applies the preprocessor to the input data to handle missing values and encode categorical features, and then selects the top 10 most important features using the SelectKBest method.

Finally, we define the final pipeline by combining the feature engineering pipeline with a Random Forest Classifier. The Random Forest Classifier uses an ensemble of decision trees to predict the target variable.

We then split the data into training and testing sets using train_test_split and train the pipeline on the training data. Finally, we evaluate the pipeline on the test data using the score method, which returns the accuracy of the model.

To improve the pipeline, we could try tuning the hyperparameters of the Random Forest Classifier using GridSearchCV or RandomizedSearchCV. We could also experiment with different feature selection methods, preprocessing steps, or classification algorithms.

# Q. 2

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score

# Define the column transformer for preprocessing the data
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Define the Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

# Define the Logistic Regression Classifier
lr = LogisticRegression()

# Define the Voting Classifier
voting = VotingClassifier(estimators=[('rfc', rfc), ('lr', lr)], voting='hard')

# Define the pipeline
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('voting', voting)])

# Fit the pipeline on the training set
pipe.fit(X_train, y_train)

# Evaluate the accuracy of the pipeline on the test set
y_pred = pipe.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")


In this pipeline, we first define a preprocessor that handles missing values and encodes categorical features using one-hot encoding. We then define a RandomForestClassifier and a LogisticRegression classifier, and use a VotingClassifier to combine their predictions.

We fit the pipeline on the training set using fit(X_train, y_train) and evaluate its accuracy on the test set using predict(X_test) and accuracy_score(y_test, y_pred).

Note that in the VotingClassifier, we set voting='hard' which means that the final prediction is the majority vote of the predicted classes by the individual classifiers.

Possible improvements to this pipeline could include tuning the hyperparameters of the individual classifiers and the VotingClassifier, using a different method for handling missing values or encoding categorical features, or trying different types of classifiers in the ensemble.