# 1. Distinguish between numerical and categorical variables

In [59]:
import pandas as pd

from sklearn.impute import SimpleImputer

from sklearn.impute import KNNImputer

In [2]:
# Read csv and import to dataframe
df = pd.read_csv("sample_dataset.csv")

In [5]:
print(df.columns)


print(df.shape)

In [6]:
df.dtypes

In [13]:
# Select Categorical variable columns and Numerical Variable Columns
categorical_variable = df.select_dtypes(include=['object', 'category', 'bool']).columns
numerical_variable = df.select_dtypes(exclude=['object', 'category', 'bool']).columns

In [15]:
print(f"Categorical columns:{categorical_variable}")

print(f"Numerical columns:{numerical_variable}")

# 2. Cleaning the numerical features

In [16]:
X = df.iloc[:,0:3]

In [17]:
X

## 2.1 Replace blanks with the mean value 

- Using SimpleImputer - a closs of `scikit-learn`.
- with parameters: `missing_values` , `strategy`, `fill_value`

### What is `fit` and `transform`?

In data analysis and machine learning, `fit` and `transform` are two essential methods used in data preprocessing. They are primarily associated with operations like data scaling, imputing missing values, and encoding categorical variables.

#### **`fit`: Learning from the Data**
- **Definition:** The `fit` method is used to calculate and store information required for the transformation of data. For example, when you `fit` a scaler, it calculates the mean and standard deviation of your data. If you `fit` an imputer, it finds the values (like the mean, median, or mode) to replace the missing data points.
- **Purpose:** It analyzes the data to "learn" the necessary parameters but does not actually modify or transform the data itself.

#### **`transform`: Applying the Learned Rules**
- **Definition:** The `transform` method uses the information obtained during `fit` to modify the data accordingly. For example, if you have a scaler that has learned the mean and standard deviation, `transform` will use these values to scale your data.
- **Purpose:** It actually changes your data based on the rules or parameters learned during `fit`.

#### **`fit_transform`: Combining Both Steps**
- **Definition:** The `fit_transform` method combines both `fit` and `transform` in one step. It is convenient when you want to both learn the parameters from the data and apply the transformation in a single line of code.
- **Purpose:** It’s typically used when you are working with the training data and want to immediately apply the transformation.

### Why Use `fit` and `transform` Separately?

1. **Training vs. Testing Data:**
   - When you work with machine learning models, you usually split your data into a training set and a test set.
   - You `fit` the transformation (like scaling or imputing) on the training set so the model learns from that data.
   - You then `transform` the test set using the same learned parameters. This ensures that the transformation applied to the test set is consistent with what was done to the training set, avoiding data leakage.

2. **Consistency in Transformation:**
   - Once you `fit` the transformation on one dataset, you can use the same learned parameters to `transform` any other dataset. This is useful when you have multiple datasets or want to apply the same transformation to new, unseen data.

### Key Takeaways:

- **`fit`** learns the necessary parameters from the data (like the mean or standard deviation).
- **`transform`** applies these learned parameters to transform the data.
- **`fit_transform`** combines both `fit` and `transform` for a given dataset.
- Use `fit` and `transform` separately when working with training and test data to maintain consistency and avoid data leakage.


In [24]:
cleaner = SimpleImputer(strategy='mean')

In [27]:
cleaner.fit(X)

In [28]:
cleaner.transform(X)

## 2.2 Replace blanks with fixed value

In [33]:
cleaner = SimpleImputer(strategy= 'constant', fill_value = 0)

In [34]:
cleaner.fit_transform(X)

## 2.3 Replace blanks with median value

In [30]:
cleaner = SimpleImputer (strategy= 'median')

In [32]:
X

# 3. Clearning the categorical features

In [48]:
df['area error'].isnull().sum()

In [49]:
X = df['area error']
X

In [50]:
X.value_counts()

## 3.1 Cleaning using the most probable value    

In [51]:
cleaner = SimpleImputer(strategy= "most_frequent")

In [52]:
if isinstance(X, pd.Series):
    print("X is a Series")
elif isinstance(X, pd.DataFrame):
    print("X is a DataFrame")

In [53]:
X = X.to_frame()

In [54]:
cleaner.fit_transform(X)

## 3.2 Cleaning using a new value

In [55]:
cleaner = SimpleImputer(strategy= "constant", fill_value='Undefined')

In [56]:
cleaner.fit_transform(X)

# 4. KNN blank filling

In [64]:
X = df.iloc[:,0:3]

In [65]:
X

## 4.1 KNN imputer 

In [66]:
cleaner = KNNImputer(n_neighbors=5)

In [67]:
cleaner.fit_transform(X)

In [69]:
cleaner = KNNImputer(n_neighbors=10)
cleaner.fit_transform(X)

In [68]:
cleaner = KNNImputer(n_neighbors=1)
cleaner.fit_transform(X)

## 4.2 KNN Imputer with weights

In [70]:
cleaner = KNNImputer(n_neighbors=10, weights="distance")

In [72]:
cleaner.fit_transform(X)

# 5. ColumnTransformer and make_column_selector

In [73]:
from sklearn.compose import ColumnTransformer

## 5.1 Using ColumnTransformer

In [74]:
cleaner = ColumnTransformer([
    ('numerical transformer', SimpleImputer(strategy='mean'),numerical_variable),
    ('categorical transformer', SimpleImputer(strategy='most_frequent'),categorical_variable)
])

cleaner.fit_transform(df)

In [77]:
cleaner = ColumnTransformer([
    ('numerical transformer', SimpleImputer(strategy='mean'),[0,1,2]),
    ('categorical transformer', SimpleImputer(strategy='most_frequent'),categorical_variable)
])

cleaner.fit_transform(df)

In [79]:
cleaner = ColumnTransformer([
    ('numerical transformer', SimpleImputer(strategy='mean'),[0,1,2]),
    ('categorical transformer', SimpleImputer(strategy='most_frequent'),categorical_variable)
] , remainder = 'passthrough') # or remainder = 'drop'

cleaner.fit_transform(df)

## 5.2 make_column_selector

In [88]:
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer

In [87]:
cleaner = ColumnTransformer([
    ('numerical transformer', SimpleImputer(strategy='mean'), make_column_selector(dtype_exclude="object")),
    ('categorical transformer', SimpleImputer(strategy='most_frequent'), make_column_selector(dtype_exclude="object"))
] , remainder = 'drop') # or remainder = 'drop'

cleaner.fit_transform(df)

# 6. Exercises

### 6.1 Exercise 1
- Load `sample_dataset.csv`
-  Replace the missing in the categorical variables with "N"
- Replace the missing in numerical varialbes with mean value

In [89]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_selector

In [93]:
df = pd.read_csv("sample_dataset.csv")

In [94]:
cleaner = ColumnTransformer([
    ('numerical', SimpleImputer(strategy= 'mean'), make_column_selector(dtype_exclude="object")),
    ('categorical', SimpleImputer(strategy= 'constant', fill_value= 'N'), make_column_selector(dtype_include="object"))
])

In [95]:
cleaner.fit_transform(df)[0:15]

### 6.2 Exercise 2
- Load `sample_dataset.csv`
- Replace the missing in float variables using KNN with 10 neighbors and a distance-based weights
- Replace the missing in categorical variables using the most frequent value

In [None]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_selector

In [None]:
df = pd.read_csv("sample_dataset.csv")

In [96]:
cleaner = ColumnTransformer([
    ('float_variables', KNNImputer(n_neighbors=10, weights='distance'), make_column_selector(dtype_include='float')),
    ('categorical', SimpleImputer(strategy='most_frequent'), make_column_selector(dtype_exclude='float'))
])

In [97]:
cleaner.fit_transform(df)

In [98]:
df.dtypes

`target` is `int` so the function did not process it.!

# 7. Reflection

## 7.1 Understanding the importance of Data Cleaning:
- Clean data ensures that models learn meaningful patterns instead of **noise or errors**, leading to more accurate predictions and insights.

## 7.2 Handling Missing Values:
- Different strategies for imputing missing values, such as mean, median, and most frequent value replacement, were explored. 
- Each technique has its strengths and is suitable for specific scenarios based on the nature of data. Specifically:
- Use SimpleImputer when:
    - Quick and straightforward solution.
    - The data is simple, well-behaved, have normal distribution (bell-shape) and the missing values are randomly distributed.
    - There are few missing values relative to the total number of observations.
- Use KNN Imputer when:
    - The data has complex feature interactions and correlations.    
    - The missing values are `likely related` to other features in the data.
    - We have sufficient computational resources and time to perform more sophisticated imputation.
    - Preserving the natural distribution and relationships within the data is crucial.
        - 
## 7.3. **ColumnTransformer and Its Application:**
   - Learning about `ColumnTransformer` has been particularly enlightening. It allows for the application of different transformations to different subsets of columns within a single pipeline. This tool is invaluable when dealing with datasets that contain both numerical and categorical variables, enabling streamlined and efficient preprocessing.

## 7.4 **The Role of `fit` and `transform` Methods:**
   - Understanding the distinct roles of the `fit` and `transform` methods helped me grasp how scikit-learn processes data. `fit` is used to learn parameters from the training data, while `transform` applies these learned parameters to transform the data. This separation is crucial when working with training and test sets to avoid data leakage and ensure consistent transformations.

## 7.5 **Differentiating `impute` and `compose`:**
   - The distinction between the `impute` module, which deals specifically with handling missing data, and the `compose` module, which manages the application of multiple transformations across different data types, clarified the structured approach of scikit-learn to preprocessing.

Before deciding whether to use `SimpleImputer` or `KNN Imputer` for handling missing values in a dataset, it's important to assess the characteristics of the data and the nature of the missing values. Here's a checklist of key factors to consider:

### 1. **Distribution of Missing Values:**
- **Percentage of Missing Data:** Calculate the percentage of missing values in each column. If the percentage is very high (e.g., > 30%), imputation might not be the best solution; consider removing those columns or using a more sophisticated method.
- **Pattern of Missing Data:** Check if the missing data is randomly distributed (Missing Completely at Random - MCAR) or if there are patterns (Missing at Random - MAR or Not Missing at Random - NMAR). Patterns can influence which imputation method is more appropriate.

### 2. **Data Size and Computational Resources:**
- **Dataset Size:** KNN Imputer can be computationally expensive, especially for large datasets. If the dataset is very large, SimpleImputer may be preferable due to its simplicity and speed.
- **Dimensionality:** For high-dimensional data, calculating distances for KNN can be inefficient and slow. Check if the number of features is reasonable for using KNN.

### 3. **Feature Relationships:**
- **Correlation and Interactions:** Check if there are strong correlations or interactions between features. If features are highly correlated, KNN Imputer can leverage these relationships to provide more accurate imputations.
- **Feature Importance:** For critical features where accuracy is paramount, KNN Imputer may be more suitable as it considers relationships between multiple variables.

### 4. **Data Distribution:**
- **Normality:** For normally distributed data, SimpleImputer using mean or median might be sufficient. However, if the data has a skewed distribution or outliers, KNN can provide a better estimation.
- **Outliers:** Check for the presence of outliers. SimpleImputer can be influenced by outliers if using mean, whereas KNN Imputer might provide better estimates by considering neighbors' values.

### 5. **Computational Efficiency:**
- **Time and Resource Constraints:** If there are constraints on computational resources or time, SimpleImputer is much faster and requires less memory compared to KNN Imputer.
- **Modeling Pipeline Complexity:** If you're building a simple pipeline and want quick preprocessing, SimpleImputer can streamline the process without extensive tuning.

### 6. **Nature of the Problem:**
- **Predictive Model Requirements:** Consider the downstream impact on the model. If the model relies heavily on feature relationships, KNN may be a better choice.
- **Data Domain and Context:** In some domains, such as healthcare or finance, preserving data relationships might be crucial for model performance and interpretability, making KNN Imputer more suitable.

### 7. **Exploratory Data Analysis (EDA):**
- **Visual Analysis:** Use visualizations like heatmaps or scatter plots to understand the distribution of missing values and relationships between features.
- **Statistical Tests:** Perform statistical tests to see if the missing data is random or not. This can inform whether a more sophisticated imputation like KNN is necessary.

### Decision Summary:

- **Use SimpleImputer When:**
- The dataset is large and computational efficiency is a concern.
- Missing values are randomly distributed and not highly correlated with other features.
- You need a quick and simple solution for a baseline model or less critical features.

- **Use KNN Imputer When:**
- The dataset is of moderate size and computational resources are available.
- Features have strong correlations or dependencies, and you want to preserve these relationships.
- You have a small-to-moderate percentage of missing values, and maintaining data integrity is crucial for model performance.

By considering these factors, you can make an informed decision on whether to use SimpleImputer or KNN Imputer for your dataset.