##  Identifying and Correcting Invalid Data
Strategies for Identifying Invalid Data
1. Initial Exploratory Analysis

In [1]:
!pip install pandera

Collecting pandera
  Downloading pandera-0.26.1-py3-none-any.whl.metadata (10 kB)
Collecting typeguard (from pandera)
  Downloading typeguard-4.4.4-py3-none-any.whl.metadata (3.3 kB)
Collecting typing_extensions (from pandera)
  Using cached typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Downloading pandera-0.26.1-py3-none-any.whl (292 kB)
Downloading typeguard-4.4.4-py3-none-any.whl (34 kB)
Using cached typing_extensions-4.15.0-py3-none-any.whl (44 kB)
Installing collected packages: typing_extensions, typeguard, pandera
[2K  Attempting uninstall: typing_extensions
[2K    Found existing installation: typing_extensions 4.12.2
[2K    Uninstalling typing_extensions-4.12.2:
[2K      Successfully uninstalled typing_extensions-4.12.232m0/3[0m [typing_extensions]
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [pandera]m2/3[0m [pandera]
[1A[2K[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installe

In [2]:
import pandas as pd
import pandera as pa

# Loading data
df = pd.read_csv('data.csv')

# Basic statistics
print(df.describe())

# Checking for missing values
print(df.isnull().sum())

# Checking data types
print(df.dtypes)


          idade      salario
count  50.00000    50.000000
mean   40.40000  2477.941604
std    17.88398  1492.588455
min    -1.00000  -500.000000
25%    35.25000  1570.100894
50%    45.50000  2309.607819
75%    50.75000  3687.221923
max    64.00000  4927.363553
idade      0
salario    0
email      0
dtype: int64
idade        int64
salario    float64
email       object
dtype: object


### 2. Validation with Pandera

In [3]:
# Defining validation schema
schema = pa.DataFrameSchema({
    'age': pa.Column(int, checks=pa.Check.ge(0)),
    'salary': pa.Column(float, checks=pa.Check.gt(0)),
    'email': pa.Column(str, checks=pa.Check.str_matches(r'^[^@]+@[^@]+\.[^@]+$'))
})

# Validating and catching errors
try:
    schema.validate(df)
except pa.errors.SchemaError as e:
    print(f"Errors found: {e}")


Errors found: column 'age' not in dataframe. Columns in dataframe: ['idade', 'salario', 'email']


top-level pandera module will be **removed in a future version of pandera**.
If you're using pandera to validate pandas objects, we highly recommend updating
your import:

```
# old import
import pandera as pa

# new import
import pandera.pandas as pa
```

If you're using pandera to validate objects from other compatible libraries
like pyspark or polars, see the supported libraries section of the documentation
for more information on how to import pandera:

https://pandera.readthedocs.io/en/stable/supported_libraries.html


```
```



### Data Correction Techniques
1. Handling Null Values

In [4]:
# Strategies for handling missing values
def handle_missing_values(df):
    # Fill with mean value
    df['age'] = df['age'].fillna(df['age'].mean())
    
    # Fill with most frequent value
    df['department'] = df['department'].fillna(df['department'].mode()[0])
    
    # Fill with a specific value
    df['status'] = df['status'].fillna('pending')
    
    return df


## 2. Data Type Correction

In [5]:
def fix_types(df):
    # Converting string to datetime
    df['birth_date'] = pd.to_datetime(df['birth_date'], errors='coerce')
    
    # Converting string to numeric
    df['salary'] = pd.to_numeric(df['salary'], errors='coerce')
    
    # Converting to category
    df['department'] = df['department'].astype('category')
    
    return df


### 3. Correcting Invalid Values

In [6]:
def fix_values(df):
    # Replacing negative values
    df.loc[df['age'] < 0, 'age'] = df['age'].mean()
    
    # Fixing invalid emails
    df.loc[~df['email'].str.contains('@', na=False), 'email'] = None
    
    # Limiting extreme values
    df.loc[df['salary'] > 1_000_000, 'salary'] = 1_000_000
    
    return df


### Validation and Correction Pipeline

In [7]:
def validation_correction_pipeline(df):
    # 1. Initial analysis
    print("Initial analysis:")
    print(df.info())
    
    # 2. Handling missing values
    df = handle_missing_values(df)
    
    # 3. Fixing data types
    df = fix_types(df)
    
    # 4. Fixing invalid values
    df = fix_values(df)
    
    # 5. Final validation
    try:
        schema.validate(df)
        print("Data successfully validated!")
    except pa.errors.SchemaError as e:
        print(f"There are still errors: {e}")
    
    return df


### Complete Practical Example

In [8]:
import pandas as pd
import pandera as pa
import numpy as np

# Creating example data with issues
data = {
    'id': [1, 2, 3, 4, 5],
    'name': ['John', 'Maria', 'Peter', None, 'Ana'],
    'age': [25, -30, 35, 40, np.nan],
    'salary': ['5000', '6000', '7000', '8000', '9000'],
    'email': ['john@email.com', 'maria@email', 'peter@email.com', None, 'ana@email.com'],
    'department': ['IT', 'HR', 'Sales', 'IT', None]
}

df = pd.DataFrame(data)

# Defining validation schema
schema = pa.DataFrameSchema({
    'id': pa.Column(int, checks=pa.Check.gt(0)),
    'name': pa.Column(str, nullable=False),
    'age': pa.Column(int, checks=pa.Check.ge(0)),
    'salary': pa.Column(float, checks=pa.Check.gt(0)),
    'email': pa.Column(str, checks=pa.Check.str_matches(r'^[^@]+@[^@]+\.[^@]+$'), nullable=True),
    'department': pa.Column(str, checks=pa.Check.isin(['IT', 'HR', 'Sales', 'Finance']), nullable=True)
})

# Applying correction pipeline
try:
    df_corrected = validation_correction_pipeline(df)

    # Checking results
    print("\nData after correction:")
    print(df_corrected)

except KeyError as e:
    print(f"Key error: {e}")
    print("Verify that all required columns are present in the DataFrame")


Initial analysis:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          5 non-null      int64  
 1   name        4 non-null      object 
 2   age         4 non-null      float64
 3   salary      5 non-null      object 
 4   email       4 non-null      object 
 5   department  4 non-null      object 
dtypes: float64(1), int64(1), object(4)
memory usage: 372.0+ bytes
None
Key error: 'status'
Verify that all required columns are present in the DataFrame


# Best Practices for Data Correction

1. **Document Changes:** Keep a record of all corrections performed.
2. **Preserve Original Data:** Always work with copies of the data.
3. **Validate After Each Step:** Ensure that corrections do not introduce new issues.
4. **Use Quality Metrics:** Define and monitor data quality metrics.
5. **Automate the Process:** Create reproducible pipelines for validation and correction.

---

# Handling Special Cases

1. **Sensitive Data:** Be careful when correcting personal or sensitive information.
2. **Temporal Data:** Consider seasonality and temporal trends.
3. **Categorical Data:** Maintain consistency across categories.
4. **Relational Data:** Preserve referential integrity across tables.
5. **Batch Data:** Implement efficient strategies for large-scale data processing.
