## Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

### Design a pipeline that includes the following steps:
1. **Use an automated feature selection method to identify the important features in the dataset**
2. **Create a numerical pipeline that includes the following steps:**
   - Impute the missing values in the numerical columns using the mean of the column values.
   - Scale the numerical columns using standardization.
3. **Create a categorical pipeline that includes the following steps:**
   - Impute the missing values in the categorical columns using the most frequent value of the column.
   - One-hot encode the categorical columns.
4. **Combine the numerical and categorical pipelines using a ColumnTransformer**
5. **Use a Random Forest Classifier to build the final model**
6. **Evaluate the accuracy of the model on the test dataset**

### Code Implementation:

In [6]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectFromModel

# Load the Titanic dataset
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Identify numerical and categorical features
numerical_features = ['Age', 'Fare', 'SibSp', 'Parch']
categorical_features = ['Pclass', 'Sex', 'Embarked']

# Drop columns not needed
X = df[numerical_features + categorical_features]
y = df['Survived']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Numerical pipeline
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine numerical and categorical pipelines
preprocessor = ColumnTransformer([
    ('num', num_pipeline, numerical_features),
    ('cat', cat_pipeline, categorical_features)
])

# Feature selection using RandomForestClassifier
feature_selector = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))

# Final pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selection', feature_selector),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train the model
pipeline.fit(X_train, y_train)

# Evaluate the model
accuracy = pipeline.score(X_test, y_test)
print(f'Model Accuracy: {accuracy:.4f}')

# Perform cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy')
print(f'Cross-validation Accuracy: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}')


Model Accuracy: 0.7542
Cross-validation Accuracy: 0.7683 ± 0.0179


### Interpretation and Improvements:
- The accuracy score gives us an idea of model performance. If it is low, we can try:
  - Tuning hyperparameters of the Random Forest Classifier.
  - Using different feature selection techniques.
  - Trying alternative imputations for missing values.
  - Experimenting with other classifiers like XGBoost.

---

## Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

### Code Implementation:


In [7]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Define classifiers
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
log_reg = LogisticRegression(max_iter=200)

# Create voting classifier
voting_clf = VotingClassifier(estimators=[("rf", rf_clf), ("lr", log_reg)], voting="hard")

# Train and evaluate model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Voting Classifier Accuracy: {accuracy:.4f}")


Voting Classifier Accuracy: 1.0000


### Interpretation and Improvements:
- A voting classifier combines the strengths of both models to improve overall accuracy.
- To further enhance performance:
  - Use soft voting instead of hard voting for probabilistic predictions.
  - Add more diverse models, such as SVM or KNN.
  - Perform hyperparameter tuning on individual classifiers.

This pipeline ensures robustness and improved classification accuracy using ensemble learning techniques.