# 4. Handling Inconsistent Data

Inconsistent data is a common issue in real-world datasets. It can arise from various sources such as manual data entry errors, different conventions, or merging datasets from multiple sources. Addressing inconsistencies is crucial for ensuring accurate analysis and clean results.

## Identifying Inconsistencies
Before cleaning, inconsistencies need to be identified. Common examples include:

1. **Duplicate Rows**:
   - Repeated entries that unnecessarily increase dataset size and can skew analysis.

2. **Inconsistent Capitalization**:
   - Variations in case (e.g., 'USA', 'usa', 'UsA').

3. **Extra Spaces**:
   - Leading or trailing spaces in text fields.

4. **Mixed Formats**:
   - Different representations of the same data, such as dates (e.g., '01/15/2024' vs '2024-01-15') or currencies (e.g., '$1000' vs '1,000 USD').

## Cleaning Techniques

### 1. Removing Duplicates
Duplicate rows are redundant and can lead to incorrect analyses.

#### Method: `drop_duplicates()`
- Removes duplicate rows from a DataFrame.
- Parameters:
  - **`keep`**: Specifies which duplicates to retain ('first', 'last', or `False` to remove all).
  - **`subset`**: Defines specific columns to check for duplicates.
  - **`inplace`**: If `True`, modifies the original DataFrame.



In [None]:
import pandas as pd

# Example DataFrame with duplicates
data = {"Name": ['Alice', 'Bob', 'Alice'], "Age": [25, 30, 25], "City": ['New York', 'Los Angeles', 'New York']}
df = pd.DataFrame(data)
print('Original DataFrame:')
print(df)

# Remove duplicate rows
df_cleaned = df.drop_duplicates(keep='first')
print('DataFrame after removing duplicates:')
print(df_cleaned)

### 2. Text Normalization

Text fields often contain inconsistencies such as varying capitalization, extra spaces, or misspellings. Normalizing text ensures uniformity.

#### Methods:
1. **`str.lower()`**: Converts text to lowercase.
2. **`str.strip()`**: Removes leading and trailing spaces.
3. **`str.replace()`**: Replaces substrings with specified values.

#### Example:


In [None]:
# Example DataFrame with inconsistent text
data = {"Name": [' Alice ', 'BOB', 'Charlie '], "City": ['new york', 'Los Angeles ', 'CHICAGO']}
df = pd.DataFrame(data)
print('Original DataFrame:')
print(df)

# Normalize text
df['Name'] = df['Name'].str.lower().str.strip()
df['City'] = df['City'].str.title().str.strip()
print('DataFrame after text normalization:')
print(df)

### 3. Standardizing Formats

Different formats for dates, currencies, or other fields can create inconsistencies.

#### Techniques:
1. **Dates**:
   - Use `to_datetime()` to convert strings to datetime objects.
   - Specify formats (e.g., `%Y-%m-%d`) to ensure consistency.

2. **Currencies**:
   - Remove symbols or commas using `str.replace()`.
   - Convert cleaned text to numeric using `pd.to_numeric()`.

#### Example:


In [None]:
# Example DataFrame with inconsistent formats
data = {"Date": ['01/15/2024', '2024-01-16', '15-Jan-2024'], "Price": ['$1,000', '$2,500', '$3,000']}
df = pd.DataFrame(data)
print('Original DataFrame:')
print(df)

# Standardize date format
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
# Clean and convert price column
df['Price'] = df['Price'].str.replace('[$,]', '', regex=True).astype(float)
print('DataFrame after standardizing formats:')
print(df)

## Summary

- **Identify inconsistencies** such as duplicate rows, varying capitalization, and mixed formats.
- Use **`drop_duplicates()`** to remove redundant rows.
- Normalize text with string methods like **`str.lower()`**, **`str.strip()`**, and **`str.replace()`**.
- Standardize formats for dates and numeric fields to ensure consistency.

Cleaning inconsistent data is a critical step in preprocessing, leading to reliable and accurate analyses.