### Data Preprocessing

Scaling: Standardizing feature ranges to improve model performance.

* StandardScaler: Rescales to mean=0, std=1.

* MinMaxScaler: Rescales to a [0, 1] range.

In [1]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("Standardized Data:\n", scaled_data)

minmax_scaler = MinMaxScaler()
minmax_scaled_data = minmax_scaler.fit_transform(data)
print("Min-Max Scaled Data:\n", minmax_scaled_data)


Standardized Data:
 [[-1.22474487 -1.22474487 -1.22474487]
 [ 0.          0.          0.        ]
 [ 1.22474487  1.22474487  1.22474487]]
Min-Max Scaled Data:
 [[0.  0.  0. ]
 [0.5 0.5 0.5]
 [1.  1.  1. ]]


Normalization: Scale rows to unit norm (useful for text/vector data).

In [2]:
from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
normalized_data = normalizer.fit_transform(data)
print("Normalized Data:\n", normalized_data)


Normalized Data:
 [[0.26726124 0.53452248 0.80178373]
 [0.45584231 0.56980288 0.68376346]
 [0.50257071 0.57436653 0.64616234]]



Encoding Categorical Variables: Convert categorical variables to numerical formats.


OneHotEncoder: For nominal data.

LabelEncoder: For ordinal data.

In [3]:
from sklearn.preprocessing import OneHotEncoder

categories = np.array([['Red'], ['Blue'], ['Green'], ['Red']])
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(categories)
print("One-Hot Encoded Data:\n", encoded)


One-Hot Encoded Data:
 [[0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


Polynomial Features: Add polynomial terms to a dataset for nonlinear modeling.

In [4]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)
poly_features = poly.fit_transform([[1, 2], [3, 4]])
print("Polynomial Features:\n", poly_features)


Polynomial Features:
 [[ 1.  1.  2.  1.  2.  4.]
 [ 1.  3.  4.  9. 12. 16.]]


PCA for Dimensionality Reduction: Reduce feature dimensions while retaining variability.



In [5]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
reduced_data = pca.fit_transform(data)
print("PCA Reduced Data:\n", reduced_data)


PCA Reduced Data:
 [[ 5.19615242  0.        ]
 [-0.          0.        ]
 [-5.19615242  0.        ]]


#### Apply Transformations on a Dataset


Objective: Prepare a dataset for regression.


Steps:


1. Scale and normalize features.

2. Encode categorical variables.

3. Apply polynomial features.


In [6]:
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split

# Load dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Preprocessing
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_scaled)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)
print("Processed Features Shape:", X_train.shape)


Processed Features Shape: (353, 66)


Automate Preprocessing Steps Using Pipelines

Pipelines: Chain preprocessing and modeling steps into a single object.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2)),
    ('model', LinearRegression())
])

pipeline.fit(X_train, y_train)
print("Pipeline Score:", pipeline.score(X_test, y_test))


Pipeline Score: -26.818578169675583


Handle Imbalanced Datasets

Oversampling and Undersampling:


Oversample minority classes (e.g., SMOTE).

Undersample majority classes.

In [14]:
import numpy as np
from sklearn.datasets import make_classification
from collections import Counter

# Generate an imbalanced dataset
X, y = make_classification(n_classes=2, weights=[0.9, 0.1], n_samples=1000, random_state=42)
print("Original Class Distribution:", dict(Counter(y)))

# Identify minority and majority class
minority_class = 1
X_minority = X[y == minority_class]

# Oversample the minority class
X_oversampled = np.tile(X_minority, (9, 1))  # Duplicate minority samples 9 times
y_oversampled = np.ones(len(X_oversampled))

# Combine with majority class
X_resampled = np.vstack([X, X_oversampled])
y_resampled = np.hstack([y, y_oversampled])

print("Resampled Class Distribution:", dict(Counter(y_resampled)))


Original Class Distribution: {0: 897, 1: 103}
Resampled Class Distribution: {0.0: 897, 1.0: 1030}


In [16]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Generate an imbalanced dataset
X, y = make_classification(n_classes=2, weights=[0.9, 0.1], n_samples=1000, random_state=42)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Scale the features
    ('classifier', DecisionTreeClassifier())  # Apply a classifier
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Evaluate the pipeline
accuracy = pipeline.score(X_test, y_test)
print("Pipeline Accuracy:", accuracy)


Pipeline Accuracy: 0.9333333333333333
