# Data wrangling, imputation, and cleaning

**Data wrangling, imputation, and cleaning** are all essential steps in data preprocessing, but they serve different purposes and involve distinct techniques. Here's a breakdown:


## 1. Data Wrangling
**Definition:**  
Data wrangling (also called data munging) is the process of transforming raw data into a usable format for analysis. It involves various steps to manipulate, organize, and prepare the data.

**Key Activities:**  

- Merging or joining datasets  
- Reshaping data (e.g., pivoting, melting)  
- Filtering and selecting data subsets  
- Converting data types (e.g., strings to dates)  
- Aggregating or summarizing data  

**Example:**  

- Combining sales data from multiple regions into a single dataset.  
- Restructuring data from wide to long format for easier analysis.

**Goal:**  
To make the data structured and ready for analysis.


## 2. **Imputation**  
**Definition:**  

Imputation is the process of replacing missing or incomplete data with substituted values to maintain dataset integrity.

**Key Techniques:**  

- **Mean/Median/Mode Imputation:** Replace missing values with the average or most frequent value.  
- **Forward/Backward Fill:** Use previous or next values in the dataset to fill gaps.  
- **Interpolation:** Estimate values based on trends in the data.  
- **Model-based Imputation:** Use regression, k-NN, or machine learning to predict missing values.

**Example:**  

- Filling in missing temperatures in a weather dataset using the average of neighboring values.

**Goal:**  

To handle missing data without losing information or introducing significant bias.


## 3. **Data Cleaning**  
**Definition:** 

Data cleaning is the process of detecting and correcting (or removing) inaccurate, inconsistent, or irrelevant data.

**Key Activities:**  

- Removing duplicates  
- Handling outliers  
- Correcting typos and errors in data entries  
- Standardizing formats (e.g., date formats, categorical values)  
- Addressing inconsistencies (e.g., "NY" vs. "New York")

**Example:**  

- Removing duplicate entries in a customer database.  
- Converting all date formats to a consistent standard (e.g., YYYY-MM-DD).

**Goal:**  

To ensure the dataset is accurate, consistent, and free from errors.


## Summary Table

| Aspect           | Data Wrangling                          | Imputation                              | Data Cleaning                            |
|------------------|------------------------------------------|-----------------------------------------|------------------------------------------|
| **Focus**        | Organizing and restructuring data        | Filling in missing values               | Correcting errors and inconsistencies    |
| **Main Tasks**   | Merging, reshaping, filtering            | Estimating missing data                 | Removing duplicates, fixing typos        |
| **Techniques**   | Pivoting, merging, transforming          | Mean, median, model-based imputation    | Deduplication, outlier removal           |
| **Goal**         | Prepare data for analysis                | Handle missing data                     | Ensure data accuracy and consistency     |


These steps often overlap in practice, as data preparation typically involves a combination of wrangling, imputation, and cleaning to ensure high-quality data for analysis or modeling.