# <h1 style='text-align:center'>*EDA & Data Cleaning* </h1>


---
---

## <h1 style='text-aling:center'> Exploratory Data Analysis (EDA) </h1> 

### üîπ What is EDA?

**Exploratory Data Analysis (EDA)** is the process of **examining, summarizing, and visualizing data** to understand its main characteristics, patterns, relationships, and anomalies before applying formal modeling techniques.

üëâ Term introduced by **John Tukey**

---

## üîπ Objectives of EDA

* Understand data structure and content
* Identify missing values and outliers
* Detect patterns and trends
* Discover relationships between variables
* Check assumptions for statistical models
* Guide data cleaning and feature selection

---

## üîπ Types of EDA

### 1Ô∏è‚É£ Univariate Analysis

Analysis of **a single variable**

**Techniques**

* Mean, median, mode
* Histogram
* Box plot
* Frequency table

---

### 2Ô∏è‚É£ Bivariate Analysis

Analysis of **two variables**

| Variable Types             | Methods                   |
| -------------------------- | ------------------------- |
| Numerical vs Numerical     | Scatter plot, correlation |
| Categorical vs Numerical   | Box plot                  |
| Categorical vs Categorical | Crosstab, bar chart       |

---

### 3Ô∏è‚É£ Multivariate Analysis

Analysis of **more than two variables**

**Techniques**

* Pair plots
* Correlation heatmaps
* Multivariate scatter plots

---

## üîπ Statistical Measures Used in EDA

### Central Tendency

* Mean
* Median
* Mode

### Dispersion

* Range
* Variance
* Standard deviation
* Interquartile Range (IQR)

---

## üîπ Visualization Techniques

| Visualization | Purpose              |
| ------------- | -------------------- |
| Histogram     | Data distribution    |
| Box plot      | Outliers             |
| Scatter plot  | Relationship         |
| Bar chart     | Category comparison  |
| Heatmap       | Correlation strength |

---

## üîπ EDA Workflow (Theory)

1. Understand the problem
2. Inspect data structure
3. Handle missing values
4. Remove duplicates
5. Correct data types
6. Identify inconsistencies
7. Detect outliers
8. Analyze relationships
9. Summarize insights

---

## üîπ EDA in Machine Learning

* Helps in feature selection
* Identifies multicollinearity
* Decides transformations
* Improves model performance

---

## üîπ Difference Between EDA & Data Cleaning

| EDA                 | Data Cleaning       |
| ------------------- | ------------------- |
| Explores patterns   | Fixes errors        |
| Analytical          | Technical           |
| Visualization-based | Preprocessing-based |
| Guides modeling     | Prepares data       |

---

## üîπ Advantages of EDA

* Better understanding of data
* Fewer modeling errors
* Improved decision-making
* High-quality insights


---
---

##  DATA CLEANING 

 üîπ **What is Data Cleaning?**

**Data Cleaning** is the process of identifying and correcting inaccurate, incomplete, inconsistent, or irrelevant data to improve data quality.

üéØ **Objectives:**

* Improve accuracy and reliability
* Reduce noise and errors
* Prepare data for analysis and modeling

---

## üîπ Types of Data Quality Issues

1. **Missing values**
2. **Duplicate records**
3. **Incorrect data types**
4. **Inconsistent data**
5. **Outliers**
6. **Invalid data**

---


## 1. Missing Values

### üîπ What are Missing Values?

**Missing values** occur when no data is stored for a variable in an observation. They are usually represented as **NaN, NULL, blank, or NA**.

---

### üîπ Causes of Missing Values

* Data not recorded
* Human error during data entry
* Equipment or system failure
* Respondent skipped a question
* Data corruption during transfer

---

### 1.1 Types of Missing Data (Very Important)

#### 1Ô∏è‚É£ MCAR ‚Äì Missing Completely At Random

* Missingness has **no relationship** with any variable
* Example: Sensor randomly fails

üëâ Least harmful type

---

#### 2Ô∏è‚É£ MAR ‚Äì Missing At Random

* Missingness depends on **other observed variables**
* Example: Income missing more often for younger people

üëâ Can be handled using imputation

---

#### 3Ô∏è‚É£ MNAR ‚Äì Missing Not At Random

* Missingness depends on the **missing value itself**
* Example: People with high income avoid answering

üëâ Most difficult to handle

---

### 1.2 Effects of Missing Values

* Bias in analysis
* Reduced model accuracy
* Incorrect conclusions
* Loss of information

---

## 1.3 Methods to Handle Missing Values

### 1Ô∏è‚É£ Deletion Methods

**a) Listwise deletion (Row removal)**

* Remove rows with missing values
  ‚úî Simple
  ‚ùå Data loss

**b) Column removal**

* Drop columns with too many missing values

---

### 2Ô∏è‚É£ Imputation Methods

#### Numerical Data

* **Mean** ‚Äì when data is symmetric
* **Median** ‚Äì when data has outliers (preferred)
* **Mode** ‚Äì for categorical-like numbers

#### Categorical Data

* Mode
* Create a new category: `"Unknown"`

---

### 3Ô∏è‚É£ Advanced Imputation

* Forward fill / Backward fill (time-series)
* KNN imputation
* Regression imputation
* Multiple imputation

---

### üîπ Choosing the Right Method

| Situation       | Recommended Method      |
| --------------- | ----------------------- |
| Small % missing | Mean / Median           |
| Many outliers   | Median                  |
| Categorical     | Mode / New category     |
| Time-series     | Forward / Backward fill |
| MNAR            | Domain-based decision   |

---

### 1.4 Missing Values in EDA

During EDA:

* Count missing values
* Visualize missing patterns
* Decide treatment strategy

---

### 1.5 Best Practices

* Never ignore missing values
* Don‚Äôt blindly use mean
* Understand the reason for missingness
* Document your approach

---

## 2. Duplicate Records

### üîπ What are Duplicate Records?

**Duplicate records** are repeated rows or entries in a dataset that represent the **same observation** more than once.

---

### üîπ Causes of Duplicate Records

* Multiple data entry of the same record
* Data collected from different sources
* Errors during data merging or joins
* System or synchronization issues

---

### 2.1 Types of Duplicates

#### 1Ô∏è‚É£ Exact Duplicates

* All column values are identical
  **Example:** Same customer record repeated twice

---

#### 2Ô∏è‚É£ Partial / Near Duplicates

* Some fields match, others differ slightly
  **Example:**
* ‚ÄúRahul Kumar‚Äù vs ‚ÄúR. Kumar‚Äù
* Same phone number, different spelling of name

---

### 2.3 Problems Caused by Duplicates

* Inflated counts and totals
* Biased analysis and incorrect statistics
* Reduced model performance
* Misleading insights

---

## 2.4 Detecting Duplicate Records

### During EDA

* Check repeated rows
* Look for repeated IDs, emails, phone numbers

### Common Indicators

* Same unique identifier repeated
* Unusual increase in record count

---

## 2.5 Handling Duplicate Records

### 1Ô∏è‚É£ Remove Exact Duplicates

* Keep the first or last occurrence
* Remove repeated rows completely

‚úî Simple and effective

---

### 2Ô∏è‚É£ Handle Partial Duplicates

* Identify key columns (ID, email, phone)
* Standardize text before comparison
* Merge or consolidate records

‚úî Requires domain knowledge

---

### 3Ô∏è‚É£ Aggregation

* Combine duplicate records using:

  * Sum (sales)
  * Mean (ratings)
  * Max/Min (latest value)

---

## 2.6 Best Practices

* Always check duplicates before analysis
* Define what makes a record ‚Äúunique‚Äù
* Be careful with partial duplicates
* Document removal or merging rules

---

## 2.7 Duplicate Records in Data Cleaning vs EDA

* **Data Cleaning:** Identify and remove duplicates
* **EDA:** Understand how duplicates affect trends and counts

---


## 3. Data Type Correction

### üîπ What is Data Type Correction?

**Data type correction** is the process of converting data into its **appropriate and consistent format** so it can be correctly analyzed and processed.

---

### üîπ Why Data Type Correction is Important

* Ensures correct calculations
* Prevents analysis and modeling errors
* Reduces memory usage
* Enables proper sorting, filtering, and visualization

---

### 3.1 Common Data Types in Data Analysis

* **Numerical**: int, float
* **Categorical**: string, category
* **Date/Time**: date, datetime
* **Boolean**: True / False

---

### 3.2 Common Data Type Issues

| Issue                       | Example                    |
| --------------------------- | -------------------------- |
| Numbers stored as text      | `"1000"` instead of `1000` |
| Dates as strings            | `"12-01-2024"`             |
| Mixed data types            | `25`, `"twenty"`           |
| Incorrect category encoding | `"M"`, `"Male"`, `"male"`  |

---

## 3.2.1 Causes of Data Type Errors

* Manual data entry
* Importing data from CSV/Excel
* Different data sources
* Formatting inconsistencies

---

## 3.3 Data Type Correction Techniques

### 1Ô∏è‚É£ Numerical Conversion

* Convert strings to integers or floats
* Handle non-numeric values before conversion

‚úî Enables arithmetic operations

---

### 2Ô∏è‚É£ Categorical Conversion

* Convert text data to categorical type
* Standardize labels

‚úî Improves memory efficiency and consistency

---

### 3Ô∏è‚É£ Date & Time Conversion

* Convert strings to date/datetime format
* Standardize date formats

‚úî Enables time-based analysis

---

### 4Ô∏è‚É£ Boolean Conversion

* Convert ‚ÄúYes/No‚Äù, ‚Äú0/1‚Äù to True/False

‚úî Improves logical operations

---

## 3.4 Data Type Correction in EDA

During EDA:

* Inspect column data types
* Identify mismatches
* Convert before visualization or modeling

---

## 3.5 Risks of Ignoring Data Type Correction

* Incorrect aggregation results
* Sorting errors (e.g., `"100"` < `"20"`)
* Model failure or poor performance
* Misleading visualizations

---

## 3.6 Best Practices

* Always check data types after loading data
* Fix types before analysis
* Validate after conversion
* Keep raw data unchanged

---

## 4. Data Inconsistency

### üîπ What is Data Inconsistency?

**Data inconsistency** occurs when the **same data is represented in different formats, values, or structures** within a dataset, leading to confusion and inaccurate analysis.

---

### üîπ Examples of Data Inconsistency

* `"Male"`, `"male"`, `"M"`
* `"USA"`, `"U.S.A"`, `"United States"`
* Dates: `01-02-2024`, `2024/02/01`
* Extra spaces: `"Delhi "` vs `"Delhi"`

---

### üîπ Causes of Data Inconsistency

* Multiple data sources
* Manual data entry
* Lack of data standards
* Different regional or system formats
* Case sensitivity issues

---

## 4.1 Types of Data Inconsistency

### 1Ô∏è‚É£ Formatting Inconsistency

* Different date, number, or text formats

---

### 2Ô∏è‚É£ Value Inconsistency

* Different values for the same category

---

### 3Ô∏è‚É£ Structural Inconsistency

* Same data stored in different columns or units
  Example: Height in cm vs meters

---

### 4Ô∏è‚É£ Logical Inconsistency

* Values contradict each other
  Example: Age = 5, Education = ‚ÄúGraduate‚Äù

---

## 4.2 Problems Caused by Data Inconsistency

* Incorrect grouping and aggregation
* Misleading EDA results
* Poor model performance
* Decision-making errors

---

## 4.3 Handling Data Inconsistency

### 1Ô∏è‚É£ Standardization

* Convert text to lowercase/uppercase
* Trim extra spaces
* Use consistent formats

---

### 2Ô∏è‚É£ Value Mapping

* Replace multiple labels with a single standard value
  Example: `"M"`, `"male"` ‚Üí `"Male"`

---

### 3Ô∏è‚É£ Format Conversion

* Standardize date, currency, and units

---

### 4Ô∏è‚É£ Validation Rules

* Apply logical constraints
* Use reference tables or dictionaries

---

## 4.4 Data Inconsistency in Data Cleaning vs EDA

* **Data Cleaning:** Fix inconsistencies
* **EDA:** Detect patterns of inconsistency

---

## 4.5 Best Practices

* Define data standards early
* Use controlled vocabularies
* Validate after cleaning
* Document changes

---


## 5. Outliers

### üîπ What are Outliers?

**Outliers** are data points that **significantly differ** from the majority of observations in a dataset.

---

### üîπ Examples

* Most salaries: 30k‚Äì80k, one salary: 5 million
* Test scores mostly 40‚Äì90, one score: 2

---

### üîπ Causes of Outliers

* Data entry errors (extra zero, wrong unit)
* Measurement or sensor errors
* Natural extreme values
* Sampling errors

---

## 5.1 Types of Outliers

### 1Ô∏è‚É£ Global Outliers

* Extremely different from the entire dataset

---

### 2Ô∏è‚É£ Contextual Outliers

* Unusual only in a specific context
  Example: High temperature in winter

---

### 3Ô∏è‚É£ Collective Outliers

* A group of values that deviate together
  Example: Sudden spike in network traffic

---

## 5.2 Problems Caused by Outliers

* Skewed mean and variance
* Misleading EDA results
* Reduced model accuracy
* Poor visualization scaling

---

## 5.3 Outlier Detection Methods

### 1Ô∏è‚É£ Statistical Methods

* **Z-score method**
* **IQR (Interquartile Range)**

---

### 2Ô∏è‚É£ Visualization Methods

* Box plots
* Scatter plots
* Histograms

---

### 3Ô∏è‚É£ Model-Based Methods

* Isolation Forest
* DBSCAN

---

## 5.4 Handling Outliers

### 1Ô∏è‚É£ Remove Outliers

* When caused by error
* When they distort analysis

---

### 2Ô∏è‚É£ Cap / Winsorization

* Replace extreme values with upper/lower limits

---

### 3Ô∏è‚É£ Transform Data

* Log, square root transformations

---

### 4Ô∏è‚É£ Keep Outliers

* If they are meaningful (e.g., fraud detection)

---

## 5.5 Outliers in Data Cleaning vs EDA

* **EDA:** Identify and visualize outliers
* **Data Cleaning:** Decide treatment

---

## 5.6 Best Practices

* Always investigate before removing
* Use domain knowledge
* Avoid blind deletion
* Document decisions

---