### The Power of Imputers in ML


Missing data happens when some values, like age or salary, are not recorded in a dataset. To fix this, we use data imputation, which means filling in the missing spots with smart guesses like the average or most common value. This helps keep the data complete so machine learning models can work better and make accurate predictions.

### Why Is Data Imputation Important?
When we have missing data in our dataset, it can lead to inaccurate predictions or even cause our model to fail. Simply ignoring or removing data entries with missing values isn't always the best solution, especially if the missing data is a small part of a large and otherwise valuable dataset. By imputing, or filling in, the missing data, we can make better use of our dataset and improve the overall accuracy of our machine learning model

In [1]:
import numpy as np
import pandas as pd

### create a data to understand the data imputation

In [2]:
data = {'age':[25, np.nan, 30, np.nan, 35],
        'salary': [50000, 60000, np.nan,90000, np.nan]}

dataframe = pd.DataFrame(data)
dataframe

  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Unnamed: 0,age,salary
0,25.0,50000.0
1,,60000.0
2,30.0,
3,,90000.0
4,35.0,


### Data imputation via Mean, Median or Mode

In [3]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(dataframe)
imputed_df = pd.DataFrame(imputed_data, columns = dataframe.columns)
print(imputed_df)

    age        salary
0  25.0  50000.000000
1  30.0  60000.000000
2  30.0  66666.666667
3  30.0  90000.000000
4  35.0  66666.666667


### data imputation using knn imputer

In [4]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors = 2)
imputed_data = imputer.fit_transform(dataframe)
imputed_df = pd.DataFrame(imputed_data, columns = dataframe.columns)
print(imputed_df)

    age   salary
0  25.0  50000.0
1  27.5  60000.0
2  30.0  55000.0
3  27.5  90000.0
4  35.0  55000.0


## Conclusion
Data imputation is a technique to ensure that our machine learning models make the most of the available data. Whether we choose to use mean imputation for its simplicity or KNN imputation for its accuracy, understanding these techniques is crucial for any data scientist. Remember, the choice of method depends on our specific dataset and the nature of the missing values.