Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values.
- Design a pipeline that includes the following steps:
- Use an automated feature selection method to identify the important features in the dataset.
- Create a numerical pipeline that includes the following steps:
- Impute the missing values in the numerical columns using the mean of the column values.
- Scale the numerical columns using standardization.
- Create a categorical pipeline that includes the following steps:
- Impute the missing values in the categorical columns using the most frequent value of the column.
- One-hot encode the categorical columns.
- Combine the numerical and categorical pipelines using a ColumnTransformer.
- Use a Random Forest Classifier to build the final model.
- Evaluate the accuracy of the model on the test dataset.

In [21]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
# Creating a synthetic data
X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0)
X = pd.DataFrame(X, columns=['numerical_col_1', 'numerical_col_2', 'categorical_col_1', 'categorical_col_2'])
X['categorical_col_1'] = X['categorical_col_1'].apply(lambda x: 'A' if x > 0 else 'B')
X['categorical_col_2'] = X['categorical_col_2'].apply(lambda x: 'C' if x > 0 else 'D')

mask = np.random.rand(*X.shape) < 0.1
X[mask] = np.nan

In [17]:
X.head()

Unnamed: 0,numerical_col_1,numerical_col_2,categorical_col_1,categorical_col_2
0,,-0.170122,A,C
1,,-0.399569,A,
2,0.40713,-0.628558,B,D
3,0.064786,0.94486,,C
4,-1.402331,-0.307068,B,C


In [8]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler,OneHotEncoder

In [22]:
# Segregating numerical and categorical columns
categorical_cols=['categorical_col_1','categorical_col_2']
numerical_cols=['numerical_col_1','numerical_col_2']

In [10]:
# Creating pipeline for numerical features to automate the feature engineering task
num_pipeline=Pipeline(
    steps=[
    ('imputer',SimpleImputer(strategy='mean')) # Replacing missing values by the mean of the entire column,
    ('scaler',StandardScaler()) # Scaling down the data for better prediction
    
    ]
)

In [11]:
# Creating pipeline for categorical features to automate the feature engineering task
cat_pipeline=Pipeline(
    steps=[
    ('imputer',SimpleImputer(strategy='most_frequent')),
    ('one_hot_encoder',OneHotEncoder())
    ]
)

In [23]:
# Combining both the pipeline into a single entity
preprocessor=ColumnTransformer([
    ('num pipeline',num_pipeline,numerical_cols),
    ('cat_pipeline',cat_pipeline,categorical_cols)
])

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.20, random_state=42)

In [15]:
X_train=preprocessor.fit_transform(X_train)
X_test=preprocessor.transform(X_test)

In [95]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,precision_score,f1_score,classification_report

In [20]:
clf=RandomForestClassifier()
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print(f"Model accuracy: {accuracy_score(y_test,y_pred)}")

Model accuracy: 0.825


Q2: Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.

In [36]:
import seaborn as sns

In [37]:
df=sns.load_dataset('iris')

In [52]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
df['species']=encoder.fit_transform(df['species'])

In [54]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [99]:
X=df.iloc[:,:-1]
y=df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [100]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
import warnings
warnings.filterwarnings('ignore')
rfc=RandomForestClassifier()
lgr=LogisticRegression()

In [101]:
vc=VotingClassifier(estimators=[('rfc',rfc),('lgr',lgr)],voting='soft')

In [102]:
pipeline=Pipeline([
    ('model',vc)
])

In [103]:
pipeline.fit(X_train,y_train)

In [104]:
y_pred_2=pipeline.predict(X_test)
print(f"Accuracy score: {accuracy_score(y_test,y_pred_2)}")

Accuracy score: 0.9777777777777777


In [105]:
print(f"Classification report:\n {classification_report(y_test,y_pred_2)}")

Classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       1.00      0.94      0.97        18
           2       0.92      1.00      0.96        11

    accuracy                           0.98        45
   macro avg       0.97      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45

