<a href="https://colab.research.google.com/github/faisu6339-glitch/Machine-learning/blob/main/HM_KNN_Imputer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## K-Nearest Neighbors (KNN) Imputer

K-Nearest Neighbors (KNN) Imputer is a powerful and intuitive method used to fill in missing values (impute) in a dataset. Unlike simpler imputation techniques that rely on statistical measures like mean, median, or mode for an entire feature, KNN Imputer leverages the relationships between features to make more informed estimations.

### How it Works:

The core idea behind KNN Imputer is to find the 'k' nearest neighbors to an observation with a missing value and then use the values from these neighbors to estimate the missing data point. Here's a step-by-step breakdown:

1.  **Define 'k'**: First, you choose the number of neighbors (`k`) to consider. This is a crucial hyperparameter.

2.  **Distance Metric**: For an observation with a missing value, the algorithm calculates the distance between this observation and all other observations in the dataset that do *not* have a missing value for the feature in question. Common distance metrics include Euclidean distance, Manhattan distance, etc.

3.  **Identify Neighbors**: Based on the chosen distance metric, the `k` observations with the smallest distances (i.e., the 'nearest' neighbors) are identified.

4.  **Imputation**: Once the `k` neighbors are found:
    *   **For numerical features**: The missing value is typically imputed with the mean (or sometimes median) of the corresponding feature values from its `k` nearest neighbors.
    *   **For categorical features**: The missing value is usually imputed with the mode (most frequent value) of the corresponding feature values from its `k` nearest neighbors.

### Advantages:

*   **Preserves data relationships**: KNN Imputer considers the correlation and covariance between features, leading to more accurate imputation compared to simple methods.
*   **Handles various data types**: It can be applied to both numerical and categorical data (with appropriate distance metrics and imputation strategies).
*   **Less prone to bias**: By using multiple neighbors, it can be less biased than single-value imputation methods.
*   **Produces plausible values**: The imputed values are derived from actual data points, making them more realistic.

### Disadvantages:

*   **Computational cost**: For large datasets, finding the `k` nearest neighbors for every missing value can be computationally expensive and time-consuming.
*   **Sensitivity to 'k'**: The choice of `k` can significantly impact the imputation results. An optimal `k` often requires experimentation.
*   **Impact of irrelevant features**: If the dataset contains many irrelevant features, they can distort the distance calculations and lead to poor neighbor selection.
*   **Scalability**: Distance calculations are affected by feature scales, so data scaling (e.g., StandardScaler) is often a necessary preprocessing step.
*   **Small `k` issues**: If `k` is too small, the imputation can be overly sensitive to noise in the data. If `k` is too large, it might average out important local patterns.

### When to Use:

KNN Imputer is generally a good choice when:

*   You have a relatively small to medium-sized dataset.
*   The missingness is not extensive (e.g., not > 50% missing for a feature).
*   You suspect that the missing values are related to other features in the dataset.
*   You need a more sophisticated imputation method than simple mean/median imputation but want to avoid model-based imputation for simplicity.

In [2]:
from sklearn.preprocessing import StandardScaler


In [3]:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer

# Sample data
df = pd.DataFrame({
    'Age': [25, 30, np.nan, 35],
    'Income': [40000, 50000, 42000, 60000],
    'Score': [80, 90, 82, 95]
})

# Apply KNN Imputer
imputer = KNNImputer(n_neighbors=2)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print(df_imputed)


    Age   Income  Score
0  25.0  40000.0   80.0
1  30.0  50000.0   90.0
2  27.5  42000.0   82.0
3  35.0  60000.0   95.0


In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('imputer', KNNImputer(n_neighbors=3))
])

df_imputed = pipeline.fit_transform(df)


In [6]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

X_train shape: (3, 3)
X_test shape: (1, 3)


In [7]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2)
imputer.fit(X_train)
X_train_imputed = imputer.transform(X_train)
X_test_imputed = imputer.transform(X_test)

print("X_train_imputed:\n", X_train_imputed)
print("X_test_imputed:\n", X_test_imputed)

X_train_imputed:
 [[3.5e+01 6.0e+04 9.5e+01]
 [2.5e+01 4.0e+04 8.0e+01]
 [3.0e+01 4.2e+04 8.2e+01]]
X_test_imputed:
 [[3.e+01 5.e+04 9.e+01]]


#Basic KNN Imputer (Single Missing Column)

In [9]:
import numpy as np
from sklearn.impute import KNNImputer
import pandas as pd

In [10]:
# Dataset
df = pd.DataFrame({
    'Age': [22, 25, np.nan, 30, 28],
    'Salary': [30000, 35000, 32000, 40000, 38000]
})


In [11]:
print("Before Imputation:\n", df)

# KNN Imputer
imputer = KNNImputer(n_neighbors=2)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("\nAfter Imputation:\n", df_imputed)

Before Imputation:
     Age  Salary
0  22.0   30000
1  25.0   35000
2   NaN   32000
3  30.0   40000
4  28.0   38000

After Imputation:
     Age   Salary
0  22.0  30000.0
1  25.0  35000.0
2  23.5  32000.0
3  30.0  40000.0
4  28.0  38000.0


#Multiple Columns Missing Values

In [12]:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer

df = pd.DataFrame({
    'Age': [22, np.nan, 28, np.nan, 30],
    'Income': [30000, 35000, np.nan, 42000, 40000],
    'Score': [80, 85, 82, np.nan, 90]
})

print("Before Imputation:\n", df)

imputer = KNNImputer(n_neighbors=3)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("\nAfter Imputation:\n", df_imputed)


Before Imputation:
     Age   Income  Score
0  22.0  30000.0   80.0
1   NaN  35000.0   85.0
2  28.0      NaN   82.0
3   NaN  42000.0    NaN
4  30.0  40000.0   90.0

After Imputation:
          Age   Income  Score
0  22.000000  30000.0   80.0
1  26.666667  35000.0   85.0
2  28.000000  35000.0   82.0
3  26.000000  42000.0   85.0
4  30.000000  40000.0   90.0


#KNN Imputer with Feature Scaling

In [13]:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

df = pd.DataFrame({
    'Age': [20, 25, np.nan, 35],
    'Income': [20000, 50000, 300000, 70000],
    'Score': [70, 85, 80, np.nan]
})

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('imputer', KNNImputer(n_neighbors=2))
])

df_imputed = pipeline.fit_transform(df)

print("After Scaling + KNN Imputation:\n", df_imputed)


After Scaling + KNN Imputation:
 [[-1.06904497 -0.80985829 -1.33630621]
 [-0.26726124 -0.53990552  1.06904497]
 [ 0.53452248  1.70970083  0.26726124]
 [ 1.33630621 -0.35993702 -0.13363062]]


#Train‚ÄìTest Split (Avoid Data Leakage üö®)

In [14]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer

# Data
df = pd.DataFrame({
    'Age': [22, np.nan, 25, 30, np.nan, 28],
    'Income': [30000, 32000, 35000, 40000, 38000, 36000],
    'Purchased': [0, 1, 0, 1, 1, 0]
})

X = df.drop('Purchased', axis=1)
y = df['Purchased']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

imputer = KNNImputer(n_neighbors=2)

# Correct way
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

print("Train Imputed:\n", X_train_imputed)
print("\nTest Imputed:\n", X_test_imputed)


Train Imputed:
 [[2.8e+01 3.6e+04]
 [2.5e+01 3.5e+04]
 [2.9e+01 3.8e+04]
 [3.0e+01 4.0e+04]]

Test Imputed:
 [[2.20e+01 3.00e+04]
 [2.65e+01 3.20e+04]]


#Compare Mean Imputer vs KNN Imputer

In [15]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer, KNNImputer

df = pd.DataFrame({
    'Age': [20, 22, np.nan, 28, 30],
    'Income': [20000, 22000, 24000, 30000, 32000]
})

# Mean Imputer
mean_imputer = SimpleImputer(strategy='mean')
mean_result = mean_imputer.fit_transform(df)

# KNN Imputer
knn_imputer = KNNImputer(n_neighbors=2)
knn_result = knn_imputer.fit_transform(df)

print("Mean Imputer Result:\n", mean_result)
print("\nKNN Imputer Result:\n", knn_result)


Mean Imputer Result:
 [[2.0e+01 2.0e+04]
 [2.2e+01 2.2e+04]
 [2.5e+01 2.4e+04]
 [2.8e+01 3.0e+04]
 [3.0e+01 3.2e+04]]

KNN Imputer Result:
 [[2.0e+01 2.0e+04]
 [2.2e+01 2.2e+04]
 [2.1e+01 2.4e+04]
 [2.8e+01 3.0e+04]
 [3.0e+01 3.2e+04]]


#Label Encoding + KNN Imputer

In [16]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer

# Dataset
df = pd.DataFrame({
    'Age': [22, np.nan, 28, 30],
    'Income': [30000, 35000, np.nan, 40000],
    'Gender': ['Male', 'Female', 'Female', 'Male']
})

# Encode categorical column
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])

# KNN Imputer
imputer = KNNImputer(n_neighbors=2)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print(df_imputed)


    Age   Income  Gender
0  22.0  30000.0     1.0
1  25.0  35000.0     0.0
2  28.0  37500.0     0.0
3  30.0  40000.0     1.0


#OneHotEncoding + KNN Imputer (Correct for Nominal Data ‚≠ê)

In [17]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import KNNImputer

df = pd.DataFrame({
    'Age': [25, np.nan, 30, 28],
    'Salary': [40000, 45000, np.nan, 48000],
    'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai']
})

# One-hot encode
df_encoded = pd.get_dummies(df, columns=['City'])

# KNN Imputer
imputer = KNNImputer(n_neighbors=2)
df_imputed = pd.DataFrame(imputer.fit_transform(df_encoded),
                           columns=df_encoded.columns)

print(df_imputed)


    Age   Salary  City_Chennai  City_Delhi  City_Mumbai
0  25.0  40000.0           0.0         1.0          0.0
1  29.0  45000.0           0.0         0.0          1.0
2  30.0  46500.0           0.0         1.0          0.0
3  28.0  48000.0           1.0         0.0          0.0


In [18]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import KNNImputer

# Dataset
df = pd.DataFrame({
    'Age': [22, np.nan, 28, 35],
    'Income': [30000, 40000, np.nan, 50000],
    'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai'],
    'Purchased': [0, 1, 0, 1]
})

X = df.drop('Purchased', axis=1)
y = df['Purchased']

num_cols = ['Age', 'Income']
cat_cols = ['City']

# Preprocessing pipelines
num_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('imputer', KNNImputer(n_neighbors=2))
])

cat_pipeline = Pipeline([
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])

X_processed = preprocessor.fit_transform(X)

print("Processed Data:\n", X_processed)


Processed Data:
 [[-1.19216603 -1.22474487  0.          1.          0.        ]
 [ 0.03137279  0.          0.          0.          1.        ]
 [-0.06274558  0.          0.          1.          0.        ]
 [ 1.25491161  1.22474487  1.          0.          0.        ]]


In [19]:
df=pd.read_csv("Titanic.csv")

In [20]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [23]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [24]:
df = df.loc[:, ['Age', 'Pclass', 'Fare', 'Survived']]


In [25]:
df.columns

Index(['Age', 'Pclass', 'Fare', 'Survived'], dtype='object')

In [26]:
df.head()

Unnamed: 0,Age,Pclass,Fare,Survived
0,22.0,3,7.25,0
1,38.0,1,71.2833,1
2,26.0,3,7.925,1
3,35.0,1,53.1,1
4,35.0,3,8.05,0


In [29]:
df.isnull().mean()*100

Unnamed: 0,0
Age,19.86532
Pclass,0.0
Fare,0.0
Survived,0.0


In [30]:
X=df.drop(columns=['Survived'])
y=df['Survived']

In [31]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [32]:
X_train.head()

Unnamed: 0,Age,Pclass,Fare
331,45.5,1,28.5
733,23.0,2,13.0
382,32.0,3,7.925
704,26.0,3,7.8542
813,6.0,3,31.275


In [36]:
knn=KNNImputer(n_neighbors=4)

X_train_trf=knn.fit_transform(X_train)
X_test_trf=knn.transform(X_test)

In [37]:
pd.DataFrame(X_train_trf,columns=X_train.columns)

Unnamed: 0,Age,Pclass,Fare
0,45.5,1.0,28.5000
1,23.0,2.0,13.0000
2,32.0,3.0,7.9250
3,26.0,3.0,7.8542
4,6.0,3.0,31.2750
...,...,...,...
707,21.0,3.0,7.6500
708,41.5,1.0,31.0000
709,41.0,3.0,14.1083
710,14.0,1.0,120.0000


In [38]:
df.isnull().mean()*100

Unnamed: 0,0
Age,19.86532
Pclass,0.0
Fare,0.0
Survived,0.0


In [40]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr=LogisticRegression()

lr.fit(X_train_trf,y_train)

y_pred=lr.predict(X_test_trf)

accuracy_score(y_test,y_pred)

0.7430167597765364

Comparison with Simple Imputer---> mean

In [41]:
si=SimpleImputer()

X_train_trf2=si.fit_transform(X_train)
X_test_trf2=si.transform(X_test)

In [42]:
lr=LogisticRegression()

lr.fit(X_train_trf2,y_train)

y_pred2=lr.predict(X_test_trf2)

accuracy_score(y_test,y_pred2)

0.7374301675977654

### How to Improve Model Accuracy

Improving the accuracy of a machine learning model often involves a combination of techniques applied at different stages of the machine learning pipeline. Here are some common strategies:

1.  **Feature Engineering**: Creating new features from existing ones or transforming existing features can significantly boost model performance. This might include:
    *   **Polynomial Features**: Creating interaction terms or higher-order terms.
    *   **Domain-Specific Features**: Leveraging expert knowledge to create relevant features.
    *   **Binning/Discretization**: Grouping continuous variables into bins.

2.  **Feature Scaling**: Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. This is especially important for distance-based algorithms like KNN, SVMs, and neural networks.
    *   **Standardization (e.g., `StandardScaler`)**: Transforms data to have a mean of 0 and a standard deviation of 1.
    *   **Normalization (e.g., `MinMaxScaler`)**: Scales data to a fixed range, usually 0 to 1.

3.  **Handling Missing Values (Imputation)**:
    *   **Advanced Imputation**: Beyond simple mean/median imputation, consider more sophisticated methods like `KNNImputer` (as shown), `IterativeImputer` (MICE), or even model-based imputation if missingness is complex.
    *   **Indicator Variables**: Adding a binary column to indicate whether a value was originally missing can sometimes capture valuable information.

4.  **Outlier Treatment**: Outliers can disproportionately influence model training. Techniques include:
    *   **Removal**: If outliers are clearly errors or few in number.
    *   **Transformation**: Using log or square root transformations.
    *   **Capping/Winsorization**: Replacing outliers with a specified percentile value.

5.  **Model Selection**: Different algorithms excel at different types of problems and datasets. Experiment with a variety of models:
    *   **Linear Models**: Logistic Regression, Linear Regression.
    *   **Tree-based Models**: Decision Trees, Random Forests, Gradient Boosting Machines (e.g., XGBoost, LightGBM).
    *   **Support Vector Machines (SVMs)**.
    *   **Neural Networks**.

6.  **Hyperparameter Tuning**: Most models have hyperparameters that are not learned from the data but control the learning process itself. Optimizing these can lead to significant improvements.
    *   **Grid Search (`GridSearchCV`)**: Exhaustively searches through a specified parameter grid.
    *   **Random Search (`RandomizedSearchCV`)**: Randomly samples parameters from a distribution.
    *   **Bayesian Optimization**: More advanced techniques to efficiently find optimal hyperparameters.

7.  **Ensemble Methods**: Combining multiple models can often lead to better performance than any single model.
    *   **Bagging (e.g., Random Forest)**: Trains multiple models independently and averages their predictions.
    *   **Boosting (e.g., AdaBoost, Gradient Boosting)**: Trains models sequentially, with each new model trying to correct the errors of the previous ones.
    *   **Stacking**: Trains a meta-model to combine the predictions of several base models.

8.  **Cross-Validation**: Using robust cross-validation strategies (e.g., k-fold cross-validation) helps in getting a more reliable estimate of model performance and avoids overfitting to the validation set.

9.  **Addressing Overfitting/Underfitting**:
    *   **Overfitting**: The model performs well on training data but poorly on unseen data. Solutions include more data, regularization, simpler models, or early stopping.
    *   **Underfitting**: The model is too simple to capture the underlying patterns in the data. Solutions include more complex models, more features, or reducing regularization.

10. **Collect More Data**: If feasible, increasing the amount of training data can often improve model generalization and accuracy, especially for complex models.