In [None]:
Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values
 Design a pipeline that includes the following steps"
 Use an automated feature selection method to identify the important features in the dataset
 Create a numerical pipeline that includes the following steps"
 Impute the missing values in the numerical columns using the mean of the column values
 Scale the numerical columns using standardisation 
 Create a categorical pipeline that includes the following steps"
 Impute the missing values in the categorical columns using the most frequent value of the column
 One-hot encode the categorical columns 
 Combine the numerical and categorical  pipeline using a Column Transformer
 Use a Random Forest Classifier to build the final model
 Evaluate the accuracy of the model on the test dataset
 Note! Your solution should include code snippets for each step of the  pipeline and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipeline.



Ans:
    
    To design a pipeline for your machine learning project that includes automated feature selection, 
    handling missing values, and building a Random Forest Classifier model,
    you can use Python with libraries like scikit-learn and pandas. Here's 
    a step-by-step implementation of the pipeline:
    
    import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load your dataset (replace 'data.csv' with your dataset file)
data = pd.read_csv('data.csv')

# Split the data into features (X) and target (y)
X = data.drop('target_column', axis=1)
y = data['target_column']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a numerical pipeline for numerical features
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Create a categorical pipeline for categorical features
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine numerical and categorical pipelines using a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])

# Create a feature selection step using SelectFromModel
feature_selection = SelectFromModel(RandomForestClassifier(n_estimators=100))

# Create the final pipeline with feature selection and Random Forest Classifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selection', feature_selection),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


Explanation of each step:

Load the dataset: Load your dataset into a DataFrame.
Split the data: Split the dataset into features (X) and the target variable (y),
and further split it into training and test sets.
Create numerical and categorical pipelines: Define separate pipelines for numerical
and categorical features. Numerical pipeline imputes missing values with the mean and 
standardizes the features. The categorical pipeline imputes missing values with the most
frequent value and one-hot encodes the categorical features.
Combine pipelines with ColumnTransformer: Use a ColumnTransformer to combine the numerical 
and categorical pipelines.
Feature selection: Apply feature selection using SelectFromModel with a RandomForestClassifier.
This step helps in selecting the most important features.
Create the final pipeline: Combine the preprocessing steps with the feature selection
and a Random Forest Classifier.
Fit and evaluate: Fit the pipeline on the training data and evaluate the model's
accuracy on the test data.


 Interpretation of Results:
        
The pipeline includes data preprocessing, feature selection, and model training.
It handles missing values, scales numerical features, and one-hot encodes categorical features.
Feature selection is performed using a Random Forest Classifier to identify important features.
The final model is a Random Forest Classifier.
 The accuracy of the model on the test dataset is printed.

Possible improvements:

Hyperparameter tuning: Optimize the hyperparameters of the Random Forest Classifier and other
components of the pipeline to achieve better performance.
Feature engineering: Experiment with different feature engineering techniques, such as 
creating new features, to improve model accuracy.
Cross-validation: Use cross-validation to assess the pipeline's performance
more robustly and avoid overfitting.
Handling class imbalance: If the target classes are imbalanced, consider using techniques
like oversampling or undersampling to balance the dataset.
Model selection: Explore other machine learning algorithms and compare their performance
with the Random Forest Classifier to choose the best model for your task.













Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.




Ans:
    To build a pipeline that includes a Random Forest Classifier and a Logistic Regression Classifier,
    and then use a Voting Classifier to combine their predictions on the Iris dataset,
    you can follow these steps in Python using scikit-learn:
    
    
    from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Create the Logistic Regression Classifier
lr_classifier = LogisticRegression(max_iter=1000, random_state=42)

# Create a Voting Classifier that combines both classifiers
voting_classifier = VotingClassifier(estimators=[
    ('random_forest', rf_classifier),
    ('logistic_regression', lr_classifier)
], voting='soft')  # 'soft' voting uses predicted probabilities for decision

# Create a pipeline
pipeline = Pipeline([
    ('voting_classifier', voting_classifier)
])

# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the accuracy of the pipeline
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")



In this code:

We load the Iris dataset and split it into training and testing sets.
We create a Random Forest Classifier and a Logistic Regression Classifier.
We create a Voting Classifier that combines both classifiers using "soft" voting,
which takes into account the predicted probabilities.
We create a pipeline that includes the Voting Classifier.
We train the pipeline on the training data and make predictions on the test data.
Finally, we evaluate the accuracy of the pipeline on the test data.
