<a href="https://colab.research.google.com/github/bountyhunter12/Shared_Task/blob/main/missing_value_handling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [63]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler

In [64]:

data = pd.read_csv('/content/Income - Sheet1.csv')

In [65]:
# Step 1: Check for missing values
print("Missing values before replacement:")
print(data.isnull().sum())

Missing values before replacement:
Age                  1
Marital status       2
Income               1
Gender               1
Expense_per_month    1
dtype: int64


In [66]:
# Step 2: Replace placeholders with NaN
missing_placeholders = ["nan", "unknown", 0, "n/a", "Null"]
data.replace(missing_placeholders, np.nan, inplace=True)

In [67]:
# Check for missing values again
print("Missing values after replacing placeholders:")
print(data.isnull().sum())

Missing values after replacing placeholders:
Age                  2
Marital status       3
Income               2
Gender               1
Expense_per_month    1
dtype: int64


In [68]:
# 1. Handle missing values and outliers for `Age`
# Reason: Replacing negative or unrealistic values with the mean ensures valid numerical data. Missing values are imputed to retain as much information as possible.
mean_age = data['Age'][data['Age'] > 0].mean()
data['Age'] = data['Age'].apply(lambda x: mean_age if x <= 0 or pd.isnull(x) else x)


In [69]:
# 2. Handle missing and invalid values for `Income` and `Expense_per_month`
# Reason: Replacing "unknown" with NaN ensures the data can be converted to numeric. Imputing with the mean prevents data loss while maintaining consistent central tendencies.
data['Income'] = pd.to_numeric(data['Income'].replace('unknown', pd.NA), errors='coerce')
data['Expense_per_month'] = pd.to_numeric(data['Expense_per_month'], errors='coerce')
data['Income'].fillna(data['Income'].mean(), inplace=True)
data['Expense_per_month'].fillna(data['Expense_per_month'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Income'].fillna(data['Income'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Expense_per_month'].fillna(data['Expense_per_month'].mean(), inplace=True)


In [70]:
# 3. Handle missing values for categorical columns
# Reason: Filling missing values with the mode ensures consistency for categorical data by using the most common value.
data['Marital status'] = data['Marital status'].replace('Null', pd.NA)
data['Marital status'].fillna(data['Marital status'].mode()[0], inplace=True)
data['Gender'].fillna(data['Gender'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Marital status'].fillna(data['Marital status'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Gender'].fillna(data['Gender'].mode()[0], inplace=True)


In [71]:
# 4. Encode categorical variables
# Reason: One-hot encoding converts categorical variables into a format that can be used in machine learning models. Dropping the first column avoids redundancy.
data = pd.get_dummies(data, columns=['Marital status', 'Gender'], drop_first=True)

In [72]:
# 5. Standardize numerical features
# Reason: Standardization (z-score scaling) ensures that all numerical features are on the same scale, improving model performance and preventing bias due to feature magnitude.
scaler = StandardScaler()
numerical_columns = ['Age', 'Income', 'Expense_per_month']
data[numerical_columns] = scaler.fit_transform(data[numerical_columns])

In [73]:
# Save the transformed data

data.to_csv('/content/Transformed_Income.csv', index=False)
