<a href="https://colab.research.google.com/github/UrvashiiThakur/practiceGit/blob/main/15April.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Q1. Designing a Pipeline for Automated Feature Engineering

**Step-by-Step Pipeline Design**:

1. **Automated Feature Selection**:
   - Use a feature selection method like `SelectFromModel` with a `RandomForestClassifier` to identify important features.

2. **Numerical Pipeline**:
   - **Imputation**: Handle missing values in numerical columns by replacing them with the mean value.
   - **Scaling**: Standardize numerical columns using `StandardScaler`.

3. **Categorical Pipeline**:
   - **Imputation**: Handle missing values in categorical columns by replacing them with the most frequent value.
   - **Encoding**: One-hot encode the categorical columns.

4. **Combining Pipelines**:
   - Use `ColumnTransformer` to combine numerical and categorical pipelines.

5. **Modeling**:
   - Use a `RandomForestClassifier` to build the final model.

6. **Evaluation**:
   - Evaluate the accuracy of the model on the test dataset.

**Code Implementation**:

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

# Sample dataset loading (replace with actual dataset)
data = pd.read_csv('your_dataset.csv')

# Splitting features and target
X = data.drop('target', axis=1)
y = data['target']

# Splitting the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identifying numerical and categorical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object']).columns

# Numerical pipeline
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combining pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_cols),
        ('cat', categorical_pipeline, categorical_cols)
    ]
)

# Feature selection pipeline
feature_selection = SelectFromModel(RandomForestClassifier(n_estimators=100))

# Final pipeline with Random Forest Classifier
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_selection', feature_selection),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# Training the model
model_pipeline.fit(X_train, y_train)

# Predicting on the test set
y_pred = model_pipeline.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Interpretation and possible improvements
```

**Explanation**:
- **Feature Selection**: `SelectFromModel` with `RandomForestClassifier` is used to select important features.
- **Numerical Pipeline**: Imputes missing values with the mean and scales the data.
- **Categorical Pipeline**: Imputes missing values with the most frequent value and one-hot encodes the data.
- **Combining Pipelines**: `ColumnTransformer` combines the numerical and categorical pipelines.
- **Random Forest Classifier**: The final model is built using `RandomForestClassifier`.
- **Evaluation**: The model's accuracy is calculated on the test set.

**Possible Improvements**:
- Tune hyperparameters using `GridSearchCV` or `RandomizedSearchCV`.
- Explore different imputation techniques and scaling methods.
- Try different models and ensemble techniques.

### Q2. Building a Pipeline with Voting Classifier

**Code Implementation**:

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample dataset loading (replace with actual dataset)
data = pd.read_csv('iris.csv')

# Splitting features and target
X = data.drop('species', axis=1)
y = data['species']

# Splitting the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identifying numerical and categorical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object']).columns

# Numerical pipeline
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combining pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_cols),
        ('cat', categorical_pipeline, categorical_cols)
    ]
)

# Creating individual classifiers
rf_classifier = RandomForestClassifier(n_estimators=100)
lr_classifier = LogisticRegression()

# Voting Classifier
voting_classifier = VotingClassifier(estimators=[
    ('rf', rf_classifier),
    ('lr', lr_classifier)
], voting='soft')

# Final pipeline with Voting Classifier
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', voting_classifier)
])

# Training the model
model_pipeline.fit(X_train, y_train)

# Predicting on the test set
y_pred = model_pipeline.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
```

**Explanation**:
- **Voting Classifier**: Combines `RandomForestClassifier` and `LogisticRegression` using a soft voting strategy, which means averaging the probabilities of the predictions.
- **Numerical and Categorical Pipelines**: Same as in Q1.
- **Evaluation**: The model's accuracy is calculated on the test set.

These steps and explanations provide a solid foundation for building and evaluating machine learning models with pipelines in scikit-learn.