Handling missing values

 Handling missing values is crucial to ensure the dataset remains reliable. Common techniques include imputation, where missing values are replaced with the mean, median, or mode of the respective feature, and dropping, where rows or columns with excessive missing values are removed. 

In [6]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)

print("Missing values before introducing NaNs:", df.isnull().sum().sum())
# Simulate missing values (for practice)
df.iloc[0:5, 1] = np.nan  # Introduce NaN values in column 1

# Fill missing values with the column mean
imputer = SimpleImputer(strategy="mean")
df.iloc[:, :] = imputer.fit_transform(df)

if df.isnull().sum().sum() == 0:
    print("✅ All missing values have been handled. No missing values remain.")
else:
    print("⚠️ There are still missing values in the dataset.")

print(df.head())  # Check if missing values are handled



Missing values before introducing NaNs: 0
✅ All missing values have been handled. No missing values remain.
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1          3.049655                1.4               0.2
1                4.9          3.049655                1.4               0.2
2                4.7          3.049655                1.3               0.2
3                4.6          3.049655                1.5               0.2
4                5.0          3.049655                1.4               0.2


Feature scaling (standardizatin & normalization)

Machine learning models perform better when features are on a similar scale. Standardization (zero mean, unit variance) and Normalization (scaling to [0,1]) prevent larger features from dominating. This is crucial for distance-based models like KNN and SVM.

In [13]:
from sklearn.preprocessing import StandardScaler

data = load_iris()
X = data.data  # Extract features
y = data.target  # Target variable (species)

# Convert to DataFrame for better visualization
df = pd.DataFrame(X, columns=data.feature_names)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Standardize features

print(X_scaled[:5])  # Display first 5 rows after scaling


[[-0.90068117  1.01900435 -1.34022653 -1.3154443 ]
 [-1.14301691 -0.13197948 -1.34022653 -1.3154443 ]
 [-1.38535265  0.32841405 -1.39706395 -1.3154443 ]
 [-1.50652052  0.09821729 -1.2833891  -1.3154443 ]
 [-1.02184904  1.24920112 -1.34022653 -1.3154443 ]]


Feature selection

Not all features contribute equally to predictions, and irrelevant ones can lead to overfitting. Variance Threshold removes low-variance features, while SelectKBest picks the most informative ones using statistical tests. This optimizes model performance and reduces computational complexity.

In [14]:
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=2)  # Select top 2 features
X_selected = selector.fit_transform(X, y)

print(X_selected[:5])  # Display selected features


[[1.4 0.2]
 [1.4 0.2]
 [1.3 0.2]
 [1.5 0.2]
 [1.4 0.2]]


Creating polynomial selection

Linear models may not capture complex patterns in data. Polynomial features introduce squared, cubic, or higher-order terms to improve flexibility. While this can enhance accuracy, too many features may cause overfitting, so careful selection is necessary.



In [15]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

print(X_poly.shape)  # New feature set size


(150, 14)
