Task 1: Custom Transformation Using FunctionTransformer

In this task, create a log transformer and a ratio transformer using Scikit-Learn’s

FunctionTransformer. Apply these transformers to a dataset and observe the output.

● Step 1: Import a dataset (you can use housing.csv dataset or use a built-in

dataset like California housing).

In [179]:
import pandas as pd
df= pd.read_csv('housing.csv')
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


● Step 2: Create a log transformer for transforming numerical features with

heavy-tailed distributions.

In [180]:
from sklearn.preprocessing import FunctionTransformer
import numpy as np

log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)  

log_columns = ['median_house_value', 'total_rooms', 'population']
log_transformed_data = log_transformer.transform(df[log_columns])

df[['log_median_house_value', 'log_total_rooms', 'log_population']] = log_transformed_data


● Step 3: Create a ratio transformer that computes the ratio of two columns from

the dataset.

In [181]:
from sklearn.preprocessing import FunctionTransformer

ratio_transformer = FunctionTransformer(lambda X: X[:, [0]] / X[:, [1]])
ratio_columns = df[['total_rooms', 'households']].values
rooms_household_ratio = ratio_transformer.transform(ratio_columns)
df['rooms_per_household'] = rooms_household_ratio


● Step 4: Apply these transformers to the dataset and check the results.

In [182]:
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,log_median_house_value,log_total_rooms,log_population,rooms_per_household
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,13.022764,6.779922,5.774552,6.984127
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,12.789684,8.867709,7.783641,6.238137
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,12.771671,7.290975,6.206576,8.288136
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,12.740517,7.149917,6.324359,5.817352
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,12.743151,7.394493,6.336826,6.281853
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND,11.265745,7.417580,6.739337,5.045455
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND,11.252859,6.546785,5.874931,6.114035
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND,11.432799,7.720462,6.914731,5.205543
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND,11.346871,7.528332,6.608001,5.329513


Task 2: Building Custom Transformer Class

Students will create their own custom transformer by subclassing BaseEstimator and

TransformerMixin from Scikit-Learn. The transformer will standardize a specific feature

in the dataset by removing its mean and scaling by its standard deviation.

● Step 1: Create a class StandardScalerClone that implements fit, transform, and

fit_transform methods.

● Step 2: Add input validation to the fit method using check_array from
sklearn.utils.validation.

In [183]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted
import numpy as np

class StandardScalerClone(BaseEstimator, TransformerMixin):
    def __init__(self, with_mean=True): 
        self.with_mean = with_mean

    def fit(self, X, y=None):  
        X = check_array(X)  
        self.mean_ = X.mean(axis=0)
        self.scale_ = X.std(axis=0)
        self.n_features_in_ = X.shape[1]  
        return self  

    def transform(self, X):
        check_is_fitted(self)  
        X = check_array(X)
        assert self.n_features_in_ == X.shape[1], "Input has the wrong number of features!"
        
        if self.with_mean:
            X = X - self.mean_  
        return X / self.scale_  

    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X) 


In [184]:
df1= pd.read_csv('housing.csv')
df1

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


In [185]:
df=df.dropna()

● Step 3: Apply this custom transformer to any numerical column of the dataset.

In [186]:
scaler= StandardScalerClone()
scaled_features = scaler.fit_transform(df[['longitude', 'total_rooms','latitude', 'housing_median_age', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value']])


In [187]:
scaled_df = pd.DataFrame(scaled_features, columns=['longitude', 'total_rooms','latitude', 'housing_median_age', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value'])
scaled_df

Unnamed: 0,longitude,total_rooms,latitude,housing_median_age,total_bedrooms,population,households,median_income,median_house_value
0,-1.327314,-0.803813,1.051717,0.982163,-0.970325,-0.973320,-0.976833,2.345163,2.128819
1,-1.322323,2.042130,1.042355,-0.606210,1.348276,0.861339,1.670373,2.332632,1.313626
2,-1.332305,-0.535189,1.037674,1.855769,-0.825561,-0.819769,-0.843427,1.782939,1.258183
3,-1.337296,-0.623510,1.037674,1.855769,-0.718768,-0.765056,-0.733562,0.932970,1.164622
4,-1.337296,-0.461970,1.037674,1.855769,-0.611974,-0.758879,-0.628930,-0.013143,1.172418
...,...,...,...,...,...,...,...,...,...
20428,-0.758318,-0.444580,1.800677,-0.288535,-0.388895,-0.511787,-0.443207,-1.216727,-1.115492
20429,-0.818212,-0.887557,1.805358,-0.844466,-0.920488,-0.943315,-1.008223,-0.692044,-1.124155
20430,-0.823203,-0.175042,1.777272,-0.923885,-0.125472,-0.368826,-0.173778,-1.143171,-0.992477
20431,-0.873115,-0.355344,1.777272,-0.844466,-0.305834,-0.603564,-0.393506,-1.055136,-1.058316


Task 3: Clustering-Based Custom Transformer

Create a custom transformer that uses K-Means clustering to group data points and

computes the similarity of each point to the cluster centers using the RBF kernel.

● Step 1: Implement a class ClusterSimilarity that:

○ Uses KMeans clustering in the fit method.

○ Computes similarities to the cluster centers using rbf_kernel in the

transform method.

In [188]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import rbf_kernel

class ClusterSimilarity(BaseEstimator, TransformerMixin):
    def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
        self.n_clusters = n_clusters
        self.gamma = gamma
        self.random_state = random_state

    def fit(self, X, y=None, sample_weight=None):
        self.kmeans_ = KMeans(n_clusters=self.n_clusters, random_state=self.random_state)
        self.kmeans_.fit(X, sample_weight=sample_weight)
        return self  

    def transform(self, X):
        return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)

    def get_feature_names_out(self, names=None):
        return [f"Cluster {i} similarity" for i in range(self.n_clusters)]


● Step 2: Apply this custom transformer to the latitude and longitude columns of

the dataset.

In [189]:
import pandas as pd

df2 = pd.read_csv('housing.csv')

cluster_similarity = ClusterSimilarity(n_clusters=3, gamma=0.5, random_state=42)

cluster_similarity.fit(df[['latitude', 'longitude']])

similarity_scores = cluster_similarity.transform(df[['latitude', 'longitude']])

similarity_df = pd.DataFrame(similarity_scores, columns=cluster_similarity.get_feature_names_out())

similarity_df


  super()._check_params_vs_input(X, default_n_init=10)


Unnamed: 0,Cluster 0 similarity,Cluster 1 similarity,Cluster 2 similarity
0,2.127790e-08,0.929988,0.007069
1,2.409456e-08,0.928199,0.007526
2,2.299744e-08,0.920248,0.007291
3,2.201619e-08,0.917348,0.007108
4,2.201619e-08,0.917348,0.007108
...,...,...,...
20428,6.654463e-10,0.276606,0.000885
20429,4.249601e-10,0.299587,0.000719
20430,5.764152e-10,0.327113,0.000872
20431,4.108719e-10,0.349409,0.000746


Task 4: Pipelines and ColumnTransformers

Combine the transformers in a pipeline and apply them to both numerical and

categorical features of the dataset.

● Step 1: Create a numerical pipeline that:

○ Handles missing values using SimpleImputer (median strategy).

○ Applies standardization using the custom StandardScalerClone class.

In [190]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
df3= pd.read_csv('housing.csv')


num_attribs = ["longitude", "latitude", "housing_median_age",
"total_rooms",
 "total_bedrooms", "population", "households",
"median_income"]

num_pipeline = Pipeline([
 ("impute", SimpleImputer(strategy="median")),
 ("standardize", StandardScaler()),
])

● Step 2: Create a categorical pipeline that:

○ Imputes missing categorical values using SimpleImputer (most frequent

strategy).

○ Encodes the categories using OneHotEncoder.

In [191]:
from sklearn.preprocessing import OneHotEncoder

cat_attribs = ["ocean_proximity"]
cat_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Handle missing values
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encoding
])

● Step 3: Create a ColumnTransformer that:

○ Applies the numerical pipeline to numerical columns.

○ Applies the categorical pipeline to categorical columns.

In [192]:
from sklearn.compose import ColumnTransformer


preprocessing = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", cat_pipeline, cat_attribs),
])


● Step 4: Apply the full pipeline to the dataset and output the transformed data.

In [194]:

transformed_data = preprocessing.fit_transform(df3)

columns = (num_attribs + list(preprocessing.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(cat_attribs)))

transformed_df = pd.DataFrame(transformed_data)
transformed_df


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,-1.327835,1.052548,0.982143,-0.804819,-0.972476,-0.974429,-0.977033,2.344766,0.0,0.0,0.0,1.0,0.0
1,-1.322844,1.043185,-0.607019,2.045890,1.357143,0.861439,1.669961,2.332238,0.0,0.0,0.0,1.0,0.0
2,-1.332827,1.038503,1.856182,-0.535746,-0.827024,-0.820777,-0.843637,1.782699,0.0,0.0,0.0,1.0,0.0
3,-1.337818,1.038503,1.856182,-0.624215,-0.719723,-0.766028,-0.733781,0.932968,0.0,0.0,0.0,1.0,0.0
4,-1.337818,1.038503,1.856182,-0.462404,-0.612423,-0.759847,-0.629157,-0.012881,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-0.758826,1.801647,-0.289187,-0.444985,-0.388283,-0.512592,-0.443449,-1.216128,0.0,1.0,0.0,0.0,0.0
20636,-0.818722,1.806329,-0.845393,-0.888704,-0.922403,-0.944405,-1.008420,-0.691593,0.0,1.0,0.0,0.0,0.0
20637,-0.823713,1.778237,-0.924851,-0.174995,-0.123608,-0.369537,-0.174042,-1.142593,0.0,1.0,0.0,0.0,0.0
20638,-0.873626,1.778237,-0.845393,-0.355600,-0.304827,-0.604429,-0.393753,-1.054583,0.0,1.0,0.0,0.0,0.0
