
### Assignment Questions:

#### Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

**Design a pipeline that includes the following steps:**

1. **Use an automated feature selection method to identify the important features in the dataset**  
   - Use `SelectKBest` or `RFE` for feature selection.
   
   ```python
   from sklearn.feature_selection import SelectKBest, f_classif
   selector = SelectKBest(f_classif, k=10)  # Select top 10 features
   X_new = selector.fit_transform(X, y)
   ```

2. **Create a numerical pipeline that includes the following steps:**
   - **Impute the missing values in the numerical columns using the mean of the column values.**
   - **Scale the numerical columns using standardization.**

   ```python
   from sklearn.pipeline import Pipeline
   from sklearn.impute import SimpleImputer
   from sklearn.preprocessing import StandardScaler

   numerical_pipeline = Pipeline(steps=[
       ('imputer', SimpleImputer(strategy='mean')),
       ('scaler', StandardScaler())
   ])
   ```

3. **Create a categorical pipeline that includes the following steps:**
   - **Impute the missing values in the categorical columns using the most frequent value of the column.**
   - **One-hot encode the categorical columns.**

   ```python
   from sklearn.preprocessing import OneHotEncoder

   categorical_pipeline = Pipeline(steps=[
       ('imputer', SimpleImputer(strategy='most_frequent')),
       ('onehot', OneHotEncoder(handle_unknown='ignore'))
   ])
   ```

4. **Combine the numerical and categorical pipelines using a `ColumnTransformer`.**

   ```python
   from sklearn.compose import ColumnTransformer

   preprocessor = ColumnTransformer(transformers=[
       ('num', numerical_pipeline, numerical_features),
       ('cat', categorical_pipeline, categorical_features)
   ])
   ```

5. **Use a Random Forest Classifier to build the final model.**

   ```python
   from sklearn.ensemble import RandomForestClassifier
   model = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', RandomForestClassifier())])
   ```

6. **Evaluate the accuracy of the model on the test dataset.**

   ```python
   model.fit(X_train, y_train)
   y_pred = model.predict(X_test)

   from sklearn.metrics import accuracy_score
   accuracy = accuracy_score(y_test, y_pred)
   print(f"Accuracy: {accuracy:.2f}")
   ```

**Interpretation & Possible Improvements:**
- The pipeline automates feature selection, imputation, and scaling, making it efficient.
- Improvements could include experimenting with different models, fine-tuning hyperparameters using `GridSearchCV`, and addressing multicollinearity by removing correlated features.

#### Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

**Answer:**

1. **Load the Iris dataset**:

   ```python
   from sklearn.datasets import load_iris
   data = load_iris()
   X, y = data.data, data.target
   ```

2. **Build pipelines for Random Forest and Logistic Regression**:

   ```python
   from sklearn.ensemble import RandomForestClassifier
   from sklearn.linear_model import LogisticRegression
   from sklearn.pipeline import Pipeline

   rf_pipeline = Pipeline([('classifier', RandomForestClassifier())])
   lr_pipeline = Pipeline([('classifier', LogisticRegression())])
   ```

3. **Combine the classifiers using a Voting Classifier**:

   ```python
   from sklearn.ensemble import VotingClassifier

   voting_clf = VotingClassifier(estimators=[
       ('rf', RandomForestClassifier()),
       ('lr', LogisticRegression())
   ], voting='hard')
   ```

4. **Train and evaluate the pipeline**:

   ```python
   from sklearn.model_selection import train_test_split
   from sklearn.metrics import accuracy_score

   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
   
   voting_clf.fit(X_train, y_train)
   y_pred = voting_clf.predict(X_test)
   
   accuracy = accuracy_score(y_test, y_pred)
   print(f"Accuracy: {accuracy:.2f}")
   ```

**Interpretation**:
- The Voting Classifier combines predictions from both models, resulting in potentially better performance than using either model alone.
- **Improvements**: Experiment with different voting schemes (`soft` voting) and more diverse classifiers such as SVM or KNN.