# Handling Missing Data

In this section, we will explore methods to handle missing data, which is a common challenge in datasets. Missing data can occur for various reasons, and it’s crucial to address these issues before proceeding with any machine learning model.

---

## Table of Contents

1. [Identifying Missing Data](#1-identifying-missing-data)
2. [Removing Missing Data](#2-removing-missing-data)
3. [Imputing Missing Data](#3-imputing-missing-data)
4. [Advanced Imputation Techniques](#4-advanced-imputation-techniques)

---

## 1. Identifying Missing Data

Let's start by identifying missing data in the dataset.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset (replace 'data.csv' with your file)
data = pd.read_csv('data.csv')

# Check for missing values in the dataset
missing_data = data.isnull().sum()
print("Missing values in each column:\n", missing_data)

Missing values in each column:
 Name      0
Age       1
City      0
Date      0
Salary    1
dtype: int64


## 2. Removing Missing Data

Dropping missing data is a straightforward approach but can result in the loss of valuable information.

In [2]:
clenead_rows = data.dropna()
print(f"Data shape after dropping rows with missing values: {clenead_rows.shape}")

Data shape after dropping rows with missing values: (3, 5)


In [3]:
clenead_cols = data.dropna(axis=1)
print(f"Data shape after dropping columns with missing values: {clenead_cols.shape}")

Data shape after dropping columns with missing values: (5, 3)


## 3. Imputing Missing Data

Imputation helps maintain the dataset size, but the choice of method depends on the nature of the data.

In [5]:
data["Age"].fillna(data["Age"].mean(),inplace=True)
data.head()

Unnamed: 0,Name,Age,City,Date,Salary
0,Alice,24.0,New York,2023-01-01,70000.0
1,Bob,27.0,Los Angeles,2023-02-15,80000.0
2,Charlie,22.0,Chicago,2023-03-22,50000.0
3,David,25.5,Houston,2023-04-05,120000.0
4,Eve,29.0,Phoenix,2023-05-10,


## 4. Advanced Imputation Techniques

For a more sophisticated approach, Scikit-learn's SimpleImputer allows for different imputation strategies.

In [6]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = "median") # For categorical columns using the "most_frequent" value
data["Salary"]=imputer.fit_transform(data[["Salary"]])